Selection of Suitable Features for Modeling the Durations of Syllables

Krothapalli S. Rao; Shashidhar G. Koolagudi

doi:10.4236/jsea.2010.312129

Journal of Software Engineering and Applications > Vol.3 No.12, December 2010

Selection of Suitable Features for Modeling the Durations of Syllables

Krothapalli S. Rao, Shashidhar G. Koolagudi
DOI: 10.4236/jsea.2010.312129 PDF HTML 4,013 Downloads 7,700 Views Citations

Abstract

Acoustic analysis and synthesis experiments have shown that duration and intonation patterns are the two most important prosodic features responsible for the quality of synthesized speech. In this paper a set of features are proposed which will influence the duration patterns of the sequence of the sound units. These features are derived from the results of the duration analysis. Duration analysis provides a rough estimate of features, which affect the duration patterns of the sequence of the sound units. But, the prediction of durations from these features using either linear models or with a fixed rulebase is not accurate. From the analysis it is observed that there exists a gross trend in durations of syllables with respect to syllable position in the phrase, syllable position in the word, word position in the phrase, syllable identity and the context of the syllable (preceding and the following syllables). These features can be further used to predict the durations of the syllables more accurately by exploring various nonlinear models. For analying the durations of sound units, broadcast news data in Telugu is used as the speech corpus. The prediction accuracy of the duration models developed using rulebases and neural networks is evaluated using the objective measures such as percentage of syllables predicted within the specified deviation, average prediction error (µ), standard deviation (σ) and correlation coefficient (γ).

Keywords

Prosody, Syllable Duration, Syllable Position, Syllable Context, Syllable Identity, Feed Forward Neural Network

Share and Cite:

Rao, K. and Koolagudi, S. (2010) Selection of Suitable Features for Modeling the Durations of Syllables. Journal of Software Engineering and Applications, 3, 1107-1117. doi: 10.4236/jsea.2010.312129.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] K. S. Rao, “Acquisition and Incorporation Prosody Knowledge for Speech Systems in Indian Languages,” Ph.D. Thesis, Indian Institute of Technology Madras, Chennai, May 2005.

[2] L. Mary, K. S. Rao, S. V. Gangashetty and B. Yegnanarayana, “Neural Network Models for Capturing Duration and Intonation Knowledge for Language and Speaker Identification,” International Conference on Cognitive and Neural Systems, Boston, May 2004.

[3] A. S. M. Kumar, S. Rajendran and B. Yegnanarayana, “Intonation Component of Text-to-Speech System for Hindi,” Computer Speech and Language, Vol. 7, No. 3, 1993, pp. 283-301. doi:10.1006/csla.1993.1015

[4] S. Werner and E. Keller, “Prosodic Aspects of Speech,” Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, the Future Challenges, E. Kelle Edition, John Wiley, Chichester, 1994. pp. 23-40.

[5] K. K. Kumar, “Duration and Intonation Knowledge for Text-to-Speech Conversion System for Telugu and Hindi,” Master’s Thesis, Indian Institute of Technology Madras, Chennai, May 2002.

[6] S. R. R. Kumar, “Significance of Durational Knowledge for a Text-to-Speech System in an Indian Language,” Master’s Thesis, Indian Institute of Technology Madras, Chennai, March 1990.

[7] O. Sayli, “Duration Analysis and Modeling for Turkish Text-to-Speech Synthesis,” Master’s Thesis, Bogaziei University, Istanbul, 2002.

[8] A. Chopde, “Itrans Indian Language Transliteration Package Version 5.2 Source.” http://www.aczone.con/itrans/.

[9] A. N. Khan, S. V. Gangashetty and S. Rajendran, “Speech Database for Indian Languages—A Priliminary Study,” International Conference on Natural Language Processing, Mumbai, December 2002, pp. 295-301.

[10] A. N. Khan, S. V. Gangashetty and B. Yegnanarayana, “Syllabic Properties of Three Indian Languages: Implications for Speech Recognition and Language Identification,” International Conference on Natural Language Processing, Mysore, December 2003, pp. 125-134.

[11] O. Fujimura, “Syllable as a Unit of Speech Recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 23, No. 1, 1975, pp. 82-87. doi:10.1109/ TASSP.1975.1162631

[12] D. H. Klatt, “Review of Text-to-Speech Conversion for English,” Journal of Acoustic Society of America, Vol. 82, No, 3, 1987, pp. 737-793. doi:10.1121/1.395275

[13] S. Haykin, “Neural Networks: A Comprehensive Foundation”, Pearson Education Aisa, Inc., New Delhi, 1999.

[14] M. Riedi, “A Neural Network Based Model of Segmental Duration for Speech Synthesis,” Proceedings of European Conference on Speech Communication and Technology, Madrid, September 1995, pp. 599-602.

[15] K. S. Rao and B. Yegnanarayana, “Modeling Syllable Duration in Indian Languages Using Neural Networks,” Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, Montreal, May 2004, pp. 313-316.

[16] W. N. Campbell, “Predicting Segmental Durations for Accommodation within a Syllable-Level Timing Frame-work,” Proceedings of European Conference on Speech Communication and Technology, Berlin, Vol. 2, Septem-ber 1993, pp. 1081-1084.

[17] K. S. Rao and B. Yegnanarayana, “Intonation modeling for Indian languages,” Proceedings of International Conference on Spoken Language Processing, Jeju Island, October 2004, pp. 733-736.

[18] M. Vainio and T. Altosaar, “Modeling the Microprosody of Pitch and Loudness for Speech Synthesis with Neural Networks,” Proceedings of International Conference on Spoken Language Processing, Sidney, December 1998.

[19] S. Lee, K. Hirose and N. Minematsu, “Incoporation of Prosodic Modules for Large Vocabulary Continuous Speech Recognition,” Proceedings of ISCA Workshop on Prosody in Speech Recognition and Understanding, New Jersey, 2001, pp. 97-101.

[20] K. Ivano, T. Seki and S. Furui, “Noise Robust Speech Recognition Using F0 Contour Extract by Hough Transform,” Proceedings of International Conference on Spoken Language Processing, Denver, 2002, pp. 941-944.

[21] L. Mary and B. Yegnanarayana, “Prosodic Features for Speaker Verification,” Proceedings of International Conference on Spoken Language Processing, Pittsburgh, September 2006, pp. 917-920.

[22] L. Mary, “Multi Level Implicit Features for Language and Speaker Recognition,” Ph.D. Thesis, Indian Institute of Technology Madras, Chennai, June 2006.

[23] L. Mary and B. Yegnanarayana, “Consonant-Vowel Based Features for Language Identification,” International Conference on Natural Language Processing, Kanpur, December 2005, pp. 103-106.

[24] L. Mary, K. S. Rao and B. Yegnanarayana, “Neural Network Classifiers for Language Identification Using Phonotactic and Prosodic Features,” Proceedings of International Conference on Intelligent Sensing and Information Processing (ICISIP), Chennai, January 2005, pp. 404-408. doi:10.1109/ICISIP.2005.1529486

[25] S. R. R. Kumar and B. Yegnanarayana, “Significance of Durational Knowledge for Speech Synthesis in Indian Languages,” Proceedings of IEEE Region 10 Conference Convergent Technologies for the Asia-Pacific, Bombay, November 1989, pp. 486-489.

[26] E. D. Sontag, “Feedback Stabilization Using Two Hidden Layer Nets,” IEEE Transactions on Neural Networks, Vol. 3, No. 6, November 1992, pp. 981-990. doi:10.1109/ 72.165599

[27] B. Yegnanarayana, “Artificial Neural Networks,” Printice-Hall, New Delhi, India, 1999.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies

[1]	K. S. Rao, “Acquisition and Incorporation Prosody Knowledge for Speech Systems in Indian Languages,” Ph.D. Thesis, Indian Institute of Technology Madras, Chennai, May 2005.
[2]	L. Mary, K. S. Rao, S. V. Gangashetty and B. Yegnanarayana, “Neural Network Models for Capturing Duration and Intonation Knowledge for Language and Speaker Identification,” International Conference on Cognitive and Neural Systems, Boston, May 2004.
[3]	A. S. M. Kumar, S. Rajendran and B. Yegnanarayana, “Intonation Component of Text-to-Speech System for Hindi,” Computer Speech and Language, Vol. 7, No. 3, 1993, pp. 283-301. doi:10.1006/csla.1993.1015
[4]	S. Werner and E. Keller, “Prosodic Aspects of Speech,” Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, the Future Challenges, E. Kelle Edition, John Wiley, Chichester, 1994. pp. 23-40.
[5]	K. K. Kumar, “Duration and Intonation Knowledge for Text-to-Speech Conversion System for Telugu and Hindi,” Master’s Thesis, Indian Institute of Technology Madras, Chennai, May 2002.
[6]	S. R. R. Kumar, “Significance of Durational Knowledge for a Text-to-Speech System in an Indian Language,” Master’s Thesis, Indian Institute of Technology Madras, Chennai, March 1990.
[7]	O. Sayli, “Duration Analysis and Modeling for Turkish Text-to-Speech Synthesis,” Master’s Thesis, Bogaziei University, Istanbul, 2002.
[8]	A. Chopde, “Itrans Indian Language Transliteration Package Version 5.2 Source.” http://www.aczone.con/itrans/.
[9]	A. N. Khan, S. V. Gangashetty and S. Rajendran, “Speech Database for Indian Languages—A Priliminary Study,” International Conference on Natural Language Processing, Mumbai, December 2002, pp. 295-301.
[10]	A. N. Khan, S. V. Gangashetty and B. Yegnanarayana, “Syllabic Properties of Three Indian Languages: Implications for Speech Recognition and Language Identification,” International Conference on Natural Language Processing, Mysore, December 2003, pp. 125-134.
[11]	O. Fujimura, “Syllable as a Unit of Speech Recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 23, No. 1, 1975, pp. 82-87. doi:10.1109/ TASSP.1975.1162631
[12]	D. H. Klatt, “Review of Text-to-Speech Conversion for English,” Journal of Acoustic Society of America, Vol. 82, No, 3, 1987, pp. 737-793. doi:10.1121/1.395275
[13]	S. Haykin, “Neural Networks: A Comprehensive Foundation”, Pearson Education Aisa, Inc., New Delhi, 1999.
[14]	M. Riedi, “A Neural Network Based Model of Segmental Duration for Speech Synthesis,” Proceedings of European Conference on Speech Communication and Technology, Madrid, September 1995, pp. 599-602.
[15]	K. S. Rao and B. Yegnanarayana, “Modeling Syllable Duration in Indian Languages Using Neural Networks,” Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, Montreal, May 2004, pp. 313-316.
[16]	W. N. Campbell, “Predicting Segmental Durations for Accommodation within a Syllable-Level Timing Frame-work,” Proceedings of European Conference on Speech Communication and Technology, Berlin, Vol. 2, Septem-ber 1993, pp. 1081-1084.
[17]	K. S. Rao and B. Yegnanarayana, “Intonation modeling for Indian languages,” Proceedings of International Conference on Spoken Language Processing, Jeju Island, October 2004, pp. 733-736.
[18]	M. Vainio and T. Altosaar, “Modeling the Microprosody of Pitch and Loudness for Speech Synthesis with Neural Networks,” Proceedings of International Conference on Spoken Language Processing, Sidney, December 1998.
[19]	S. Lee, K. Hirose and N. Minematsu, “Incoporation of Prosodic Modules for Large Vocabulary Continuous Speech Recognition,” Proceedings of ISCA Workshop on Prosody in Speech Recognition and Understanding, New Jersey, 2001, pp. 97-101.
[20]	K. Ivano, T. Seki and S. Furui, “Noise Robust Speech Recognition Using F0 Contour Extract by Hough Transform,” Proceedings of International Conference on Spoken Language Processing, Denver, 2002, pp. 941-944.
[21]	L. Mary and B. Yegnanarayana, “Prosodic Features for Speaker Verification,” Proceedings of International Conference on Spoken Language Processing, Pittsburgh, September 2006, pp. 917-920.
[22]	L. Mary, “Multi Level Implicit Features for Language and Speaker Recognition,” Ph.D. Thesis, Indian Institute of Technology Madras, Chennai, June 2006.
[23]	L. Mary and B. Yegnanarayana, “Consonant-Vowel Based Features for Language Identification,” International Conference on Natural Language Processing, Kanpur, December 2005, pp. 103-106.
[24]	L. Mary, K. S. Rao and B. Yegnanarayana, “Neural Network Classifiers for Language Identification Using Phonotactic and Prosodic Features,” Proceedings of International Conference on Intelligent Sensing and Information Processing (ICISIP), Chennai, January 2005, pp. 404-408. doi:10.1109/ICISIP.2005.1529486
[25]	S. R. R. Kumar and B. Yegnanarayana, “Significance of Durational Knowledge for Speech Synthesis in Indian Languages,” Proceedings of IEEE Region 10 Conference Convergent Technologies for the Asia-Pacific, Bombay, November 1989, pp. 486-489.
[26]	E. D. Sontag, “Feedback Stabilization Using Two Hidden Layer Nets,” IEEE Transactions on Neural Networks, Vol. 3, No. 6, November 1992, pp. 981-990. doi:10.1109/ 72.165599
[27]	B. Yegnanarayana, “Artificial Neural Networks,” Printice-Hall, New Delhi, India, 1999.