CAB: An Energy-Based Speaker Clustering Model for Rapid Adaptation in Non-Parallel Voice Conversion

Nakashika, Toru

doi:10.21437/Interspeech.2017-133

In this paper, a new energy-based probabilistic model, called CAB (Cluster Adaptive restricted Boltzmann machine), is proposed for voice conversion (VC) that does not require parallel data during the training and requires only a small amount of speech data during the adaptation. Most of the existing VC methods require parallel data for training. Recently, VC methods that do not require parallel data (called non-parallel VCs) have been also proposed and are attracting much attention because they do not require prepared or recorded parallel speech data, unlike conventional approaches. The proposed CAB model is aimed at statistical non-parallel VC based on cluster adaptive training (CAT). This extends the VC method used in our previous model, ARBM (adaptive restricted Boltzmann machine). The ARBM approach assumes that any speech signals can be decomposed into speaker-invariant phonetic information and speaker-identity information using the ARBM adaptation matrices of each speaker. VC is achieved by switching the source speaker’s identity into those of the target speaker while retaining the phonetic information obtained by decomposition of the source speaker’s speech. In contrast, CAB speaker identities are represented as cluster vectors that determine the adaptation matrices. As the number of clusters is generally smaller than the number of speakers, the number of model parameters can be reduced compared to ARBM, which enables rapid adaptation of a new speaker. Our experimental results show that the proposed method especially performed better than the ARBM approach, particularly in adaptation.

CAB: An Energy-Based Speaker Clustering Model for Rapid Adaptation in Non-Parallel Voice Conversion

Toru Nakashika