+# One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
+This is the official implementation of the paper [One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization](https://arxiv.org/abs/1904.05742).
+By separately learning speaker and content representations, we can achieve one-shot VC by only one utterance from source speaker and one utterace from target speaker.
+You can found the demo webpage [here](https://jjery2243542.github.io/one-shot-vc-demo/), and download the pretrain model from [here](http://speech.ee.ntu.edu.tw/~jjery2243542/resource/model/is19/vctk_model.ckpt) and the coresponding normalization parameters for inference from [here](http://speech.ee.ntu.edu.tw/~jjery2243542/resource/model/is19/attr.pkl).
+We also use some preprocess script from [Kyubyong/tacotron](https://github.com/Kyubyong/tacotron) and [magenta/magenta/models/gansynth](https://github.com/tensorflow/magenta/tree/master/magenta/models/gansynth).
+
+# Differences from the paper
+The implementations are a little different from the paper, which I found them useful to stablize training process or improve audio quality. However, the experiments requires human evaluation, we only update the code but not updating the paper. The differences are listed below:
+- Not to apply dropout to the speaker encoder and content encoder.
+- Normalization placed at pre-activation position.
+- Use the original KL-divergence loss for VAE rather than unit variance version.
+- Use KL annealing, and the weight will increase to 1.
+
+# Preprocess
+We provide the preprocess script for two datasets: VCTK and LibriTTS. The download links are below.
sql = "select id, recording_url from recording where user_id={} and created_on > {} and is_public = 1 and is_deleted = 0 and media_type in (1, 2, 3, 4, 9, 10) ".format(