Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Alonso, Agustin; Erro, Daniel; Navas, Eva; Hernaez, Inma

doi:10.1007/978-3-319-49169-1_1

Agustin Alonso²¹,
Daniel Erro^21,22,
Eva Navas²¹ &
…
Inma Hernaez²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10077))

Included in the following conference series:

International Conference on Advances in Speech and Language Technologies for Iberian Languages

714 Accesses

Abstract

Speaker adaptation techniques use a small amount of data to modify Hidden Markov Model (HMM) based speech synthesis systems to mimic a target voice. These techniques can be used to provide personalized systems to people who suffer some speech impairment and allow them to communicate in a more natural way. Although the adaptation techniques don’t require a big quantity of data, the recording process can be tedious if the user has speaking problems. To improve the acceptance of these systems an important factor is to be able to obtain acceptable results with minimal amount of recordings. In this work we explore the performance of an adaptation method based on Frequency Warping which uses only vocalic segments according to the amount of available training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Vocal Tract Length Normalization Features for Audio Search

Speaker Recognition System Using Dynamic Time Warping Matching and Mel-Scale Frequency Cepstral Coefficients

Semi-automated Speaker Adaptation: How to Control the Quality of Adaptation?

References

Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar
Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., King, S., Renals, S.: Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans. Audio Speech Lang. Process. 17(6), 1208–1230 (2009)
Article Google Scholar
Yamagishi, J., Veaux, C., King, S., Renals, S.: Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoust. Sci. Technol. 33(1), 1–5 (2012)
Article Google Scholar
Creer, S., Cunningham, S., Green, P., Yamagishi, J.: Building personalised synthetic voices for individuals with severe speech impairment. Comput. Speech Lang. 27(6), 1178–1193 (2013)
Article Google Scholar
Lanchantin, P., Veaux, C., Gales, M.J.F., King, S., Yamagishi, J.: Reconstructing voices within the multiple-average-voice-model framework. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2232–2236 (2015)
Google Scholar
Alonso, A., Erro, D., Navas, E., Hernaez, I.: Speaker adaptation using only vocalic segments via frequency warping. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2764–2768 (2015)
Google Scholar
Kawahara, H., Masuda-Katsusue, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)
Article Google Scholar
Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014)
Article Google Scholar
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90-D(5), 825–834 (2007)
Google Scholar
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis, vol. 30, pp. 1315–1318 (2000)
Google Scholar
Yamagishi, J.: A training method of average voice model for HMM-based speech synthesis using MLLR. IEICE Trans. Inf. Syst. 86(8), 1956–1963 (2003)
Google Scholar
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorthims for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 19(1), 66–83 (2009)
Article Google Scholar
Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30, 3–15 (2015)
Article Google Scholar
Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2010)
Article Google Scholar
Zorila, T.C., Erro, D., Hernaez, I.: Improving the quality of standard GMM-based voice conversion systems by considering physically motivated linear transformations. Commun. Comput. Inf. Sci. 328, 30–39 (2012)
Article Google Scholar
Godoy, E., Rosec, O., Chonavel, T.: Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 20(4), 1313–1323 (2012)
Article Google Scholar
Erro, D., Navas, E., Hernaez, I.: Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans. Audio Speech Lang. Process. 21(3), 556–566 (2013)
Article Google Scholar
Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech Audio Process. 13, 930–944 (2005)
Article Google Scholar
Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11(2–3), 175–187 (1992)
Article Google Scholar
Cappé, O., Laroche, J., Moulines, E.: Regularized estimation of cepstrum envelope from discrete frequency points. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 213–216 (1995)
Google Scholar
Erro, D., Hernáez, I., Navas, E., Alonso, A., Arzelus, H., Jauk, I., Hy, N.Q., Magariños, C., Pérez-Ramón, R., Sulír, M., Tian, X., Wang, X., Ye, J.: ZureTTS: online platform for obtaining personalized synthetic voices. In: Proceedings of eNTERFACE 2014 (2014)
Google Scholar
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., et al.: The HTK Book, version 3.4 (2006)
Google Scholar

Download references

Acknowledgments

This work has been partially supported by MINECO/FEDER, UE (SpeechTech4All project, TEC2012-38939-C03-03 and RESTORE project, TEC2015-67163-C2-1-R), and the Basque Government (ELKAROLA project, KK-2015/00098).

Author information

Authors and Affiliations

AHOLAB, University of the Basque Country (UPV/EHU), Bilbao, Spain
Agustin Alonso, Daniel Erro, Eva Navas & Inma Hernaez
Basque Foundation for Science (IKERBASQUE), Bilbao, Spain
Daniel Erro

Authors

Agustin Alonso
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Erro
View author publications
You can also search for this author in PubMed Google Scholar
Eva Navas
View author publications
You can also search for this author in PubMed Google Scholar
Inma Hernaez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Agustin Alonso .

Editor information

Editors and Affiliations

INESC-ID/IST, Universidade de Lisboa, Lisbon, Portugal
Alberto Abad
I3A/University of Zaragoza, Zaragoza, Spain
Alfonso Ortega
DETI/IEETA, University of Aveiro, Aveiro, Portugal
António Teixeira
AtlantTIC Research Center, Universidad de Vigo, Vigo, Spain
Carmen García Mateo
Universitat Politècnica de València, Valencia, Spain
Carlos D. Martínez Hinarejos
University of Coimbra, Coimbra, Portugal
Fernando Perdigão
INESC-ID/ISCTE-IUL, Lisbon, Portugal
Fernando Batista
INESC-ID/IST, Universidade de Lisboa, Lisbon, Portugal
Nuno Mamede

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alonso, A., Erro, D., Navas, E., Hernaez, I. (2016). Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-49169-1_1
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49168-4
Online ISBN: 978-3-319-49169-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Vocal Tract Length Normalization Features for Audio Search

Speaker Recognition System Using Dynamic Time Warping Matching and Mel-Scale Frequency Cepstral Coefficients

Semi-automated Speaker Adaptation: How to Control the Quality of Adaptation?

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Vocal Tract Length Normalization Features for Audio Search

Speaker Recognition System Using Dynamic Time Warping Matching and Mel-Scale Frequency Cepstral Coefficients

Semi-automated Speaker Adaptation: How to Control the Quality of Adaptation?

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation