Official implementation of SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlenecks.
The audio demo can be found here.
Small Bottleneck | Large Bottleneck | |
---|---|---|
Generator | link | link |
F0 Converter | link | link |
The WaveNet vocoder is the same as in AutoVC. Please refer to the original repo to download the vocoder.
To run the demo, first create a new directory called models
and download the pretrained models and the WaveNet vocoder into this directory. Then, run demo.ipynb
. The converted results will be saved under result
.
Download the VCTK Corpus and place it under data/train
. The data directory should look like:
data
|__train
| |__P225
| |__wavfile1
| |__wavfile2
| ...
| |__P226
| ...
|__test
|__p225_001.wav # source audio for demo
|__p258_001.wav # target audio for demo
NOTE: The released models were trained only on a subset of speakers in the VCTK corpus. The full list of speakers for training is encoded as a dictionary and saved in spk_meta.pkl
. If you want to train with more speakers or use another dataset, please prepare the metadata in the following key-value format:
speaker: (id, gender)
where speaker
should be a string, id
should be a unique integer for each speaker(will be used to generate one-hot speaker vector), and gender
should either be "M"(for male) and "F"(for female).
To generate features, run
python main.py --stage 0
By default, all generated features are saved in the feat
directory.
To train a model from scratch, run
python main.py --stage 1 --config_name spsp2-large --model_type G
To finetune a pretrained model(make sure all pretrained models are downloaded into models
), run
python main.py --stage 1 --config_name spsp2-large --model_type G --resume_iters 800000
If you want to train the variant with smaller bottleneck, replace spsp2-large
with spsp2-small
. If you want to train the pitch converter, replace G
with F
.