Abstract
Automatic Speaker Recognition (ASR) in mismatched conditions is a challenging task, since robust feature extraction and classification techniques are required. Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is an efficient network that can learn to recognize speakers, text-independently, when the recording circumstances are similar. Unfortunately, when the recording circumstances differ, its performance degrades. In this paper, Radon projection of the spectrograms of speech signals is implemented to get the features, since Radon Transform (RT) has less sensitivity to noise and reverberation conditions. The Radon projection is implemented on the spectrograms of speech signals, and then 2-D Discrete Cosine Transform (DCT) is computed. This technique improves the system recognition accuracy, text-independently with less sensitivity to noise and reverberation effects. The ASR system performance with the proposed features is compared to that of the system that depends on Mel Frequency Cepstral Coefficients (MFCCs) and spectrum features. For noisy utterances at 25 dB, the recognition rate with the proposed feature reaches 80%, while it is 27% and 28% with MFCCs and spectrum, respectively. For reverberant speech, the recognition rate reaches 80.67% with the proposed features, while it reaches 54% and 62.67% with the MFCCs and spectrum, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abd El-samie, F. E. (2011). Information security for automatic speaker identification. Springer briefs in electrical and computer engineering. New York: Springer, 2011.
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2010). Action classification in soccer videos with long short-term memory recurrent neural networks (pp. 154–159). Berlin: Springer-Verlag.
Campbell, J. P. (1997). Speaker recognition: A tutorial. In Proceedings of the IEEE, Vol. 85.
Das, A., Jena, M. R., & Barik, K. K. (2014). Mel-frequency cepstral coefficient (MFCC) a novel method for speaker recognition. Digital Technologies, 1(1), 1–3.
Dennis, J., Dat, T., & Li, H. (2011). Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Processing Letters, 18(2), 130–133.
Harimi, A., Shahzadi, A., Ahmadyfard, A., & Yaghmaie, K. (2013). Speech emotion recognition using radon and discrete cosine transform based features from speech spectrogram. Journal of Intelligent Automation Systems. https://doi.org/10.22044/JIAS.2014.223
Joshi, D., Upadhayay, M. D., & Joshi, S. D. (2013). Robust language and speaker identification using image processing techniques combined with PCA. IEEE, pp. 213–218.
Kinoshita, K., et al. (2016). A summary of the reverb challenge: State-of-the-art and remaining challenges in reverberant speech processing. Journal on Advances in Signal Processing. https://doi.org/10.1186/s13634-016-0306-6
Li, X., Wu, X. (2015). Modeling speaker variability using long short-term memory networks for speech recognition. In INTERSPEECH 2015, pp. 1086–1090, Sept 6–10.
Mohanan, N., Velmurugan, R., & Rao, P. (2018). A non-convolutive NMF model for speech dereverberation. In INTERSPEECH 2018, Indian Institute of Technology Bombay.
Parada, P. P., Sharma, D., Naylor, P. A., & van Waterschoot, T. (2014). Reverberant speech recognition: A phoneme analysis. In Proc. 2014 IEEE global conf. signal inf. process. (GlobalSIP '14), Atlanta, GA, USA, Dec. 2014, pp. 567–571.
Sharma, A., Singh, S. P., & Kumar, V. (2005). Text-independent speaker identification using back propagation MLP network classifier for a closed set of speaker. In: IEEE international symposium on signal processing and information technology. Allahabad: Indian Institute of Information Technology.
Sekar, K. (2012). “Performance analysis of text-independent speaker identification system”, International conference on modeling optimisation and computer. Procedia Engineering, 38, 1925–1934.
Zazo, R., Diez, A. L., Dominguez, J. G., Toledano, D. T., & Rodriguez, J. G. (2016). Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE. https://doi.org/10.1371/journal.pone.0146917,Jan.29
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
El-Moneim, S.A., El-Mordy, E.A., Nassar, M.A. et al. Performance enhancement of text-independent speaker recognition in noisy and reverberation conditions using Radon transform with deep learning. Int J Speech Technol 25, 679–687 (2022). https://doi.org/10.1007/s10772-021-09880-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09880-6