Abstract
Voice command in multi-room smart homes for assisting people in loss of autonomy in their daily activities faces several challenges, one of them being the distant condition which impacts ASR performance. This paper presents an overview of multiple techniques for fusion of multi-source audio (pre, middle, post fusion) for automatic speech recognition for in-home voice command. The robustness of the models of speech is obtained by adaptation to the environment and to the task. Experiments are based on several publicly available realistic datasets with participants enacting activities of daily life. The corpora were recorded in natural condition, meaning background noise is sporadic, so there is no extensive background noise in the data. The smart home is equipped with one or two microphones in each room, the distance between them being larger than 1 m. An evaluation of the most suited techniques improves voice command recognition at the decoding level, by using multiple sources and model adaptation. Although Word Error Rate (WER) is between 26 and 40%, Domotic Error Rate (identical to the WER, but at the level of the voice command) is less than 5.8% for deep neural network models, the method using Feature space Maximum Likelihood Linear Regression (fMLLR) with speaker adaptation training and Subspace Gaussian Mixture Model (SGMM) exhibits comparable results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that as any assistive technology, the intrusiveness of an ICT can be accepted if the benefit is worth it.
See (Zhang et al. 2014) for an interesting study on these matters.
https://vocadom.imag.fr
References
Aman, F., Vacher, M., Rossato, S., & Portet, F. (2013). Speech recognition of aged voices in the AAL context: Detection of distress sentences. The 7th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2013 (pp. 177–184). Cluj-Napoca, Romania.
Aman, F., Aubergé, V., Vacher, M. (2016). Influence of expressive speech on ASR performances: Application to elderly assistance in smart home. In: Sojka, P., Horak, A., Kopecek, I., Pala, K. (eds) Text, speech, and dialogue: 19th International Conference, TSD 2016, New York: Springer International Publishing, pp. 522–530. 10.1007/978-3-319-45510-5_60
Anguera, X., Wooters, C., & Hernando, J. (2007). Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2011–2022. https://doi.org/10.1109/TASL.2007.902460.
Audibert, N., Aubergé, V., Rilliard, A. (2005). The prosodic dimensions of emotion in speech: The relative weights of parameters. 9th european conference on speech communication and technology. Interspeech 2005, Lisbon, Portugal, pp. 525–528.
Baba, A., Lee, A., Saruwatari, H., Shikano, K. (2002). Speech recognition by reverberation adapted acoustic model. In: ASJ General Meeting, pp. 27–28.
Baba, A., Yoshizawa, S., Yamada, M., Lee, A., Shikano, K. (2004). Acoustic models of the elderly for large-vocabulary continuous speech recognition. Electronics and Communications in Japan, Part 2, Vol. 87, No. 7, 2004, 87(2), pp. 49–57.
Badii, A., Boudy, J. (2009). CompanionAble—integrated cognitive assistive & domotic companion robotic systems for ability & security. 1st Congres of the Société Française des Technologies pour l’Autonomie et de Gérontechnologie (SFTAG’09), Troyes, pp. 18–20.
Barker, J., Vincent, E., Ma, N., Christensen, H., & Green, P. D. (2013). The PASCAL chime speech separation and recognition challenge. Computer Speech & Language, 27(3), 621–633.
Barker, J., Marxer, R., Vincent, E., Watanabe, S. (2015). The third ’chime’ speech separation and recognition challenge: Dataset, task and baselines. Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511.
Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2001). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, 33(1–2), 5–22.
Bouakaz, S., Vacher, M., Bobillier-Chaumon, M. E., Aman, F., Bekkadja, S., Portet, F., et al. (2014). CIRDO: Smart companion for helping elderly to live at home for longer. IRBM, 35(2), 101–108.
Brandstein, M., & Ward, D. (Eds.). (2001). Microphone arrays : signal processing techniques and applications. Berlin: Springer-Verlag.
Caballero-Morales, S. O., & Trujillo-Romero, F. (2014). Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition. Expert Systems with Applications, 41(3), 841–852.
Chahuara, P., Portet, F., & Vacher, M. (2017). Context-aware decision making under uncertainty for voice-based control of smart home. Expert Systems with Applications, 75, 63–79. https://doi.org/10.1016/j.eswa.2017.01.014.
Chan, M., Estéve, D., Escriba, C., & Campo, E. (2008). A review of smart homes—present state and future challenges. Computer Methods and Programs in Biomedicine, 91(1), 55–81.
Charalampos, D., Maglogiannis, I. (2008). Enabling human status awareness in assistive environments based on advanced sound and motion data classification. Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments, pp. 1:1–1:8.
Christensen, H., Casanuevo, I., Cunningham, S., Green, P., Hain, T. (2013). Homeservice: Voice-enabled assistive technology in the home using cloud-based automatic speech recognition. SLPAT, pp. 29–34.
Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmueller, M., et al. (2014). The DIRHA simulated corpus. The 9th edition of the Language Resources and Evaluation Conference (LREC) (pp. 2629–2634). Reykjavik, Iceland.
Deng, L., Acero, A., Plumpe, M., Huang, X. (2000). Large-vocabulary speech recognition under adverse acoustic environments. ICSLP-2000, ISCA, Beijing, China, Vol. 3, pp. 806–809.
Filho, G., & Moir, T. (2010). From science fiction to science fact: A smart-house interface using speech technology and a photorealistic avatar. International Journal of Computer Applications in Technology, 39(8), 32–39.
Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). Proceedings of IEEE Workshop on ASRU, pp. 347–354. https://doi.org/10.1109/ASRU.1997.659110
Fleury, A., Vacher, M., Portet, F., Chahuara, P., & Noury, N. (2013). A French corpus of audio and multimodal interactions in a health smart home. Journal on Multimodal User Interfaces, 7(1), 93–109.
Hamill, M., Young, V., Boger, J., Mihailidis, A. (2009). Development of an automated speech recognition interface for personal emergency response systems. Journal of NeuroEngineering and Rehabilitation, 6, 26
Hwang, Y., Shin, D., Yang, C. Y., Lee, S. Y., Kim, J., Kong, B., Chung, J., Kim, S., Chung, M. (2012). Developing a voice user interface with improved usability for people with dysarthria. 13th International Conference on Computers Helping People with Special Needs, ICCHP’12, pp. 117–124.
Lecouteux, B., Vacher, M., Portet, F. (2011). Distant speech recognition in a smart home: Comparison of several multisource asrs in realistic conditions. Proceedings of InterSpeech, pp. 2273–2276.
Lecouteux, B., Linares, G., Estève, Y., & Gravier, G. (2013). Dynamic combination of automatic speech recognition systems by driven decoding. IEEE Transactions on Audio, Speech & Language Processing, 21(6), 1251–1260.
Matos, M., Abad, A., Astudillo, R., & Trancoso, I. (2014). IberSPEECH 2014 (pp. 178–188). Las Palmas de Gran Canaria, Spain.
McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., et al. (2005). On the use of information retrieval measures for speech recognition evaluation, Tech. rep.. Martigny: Idiap.
Michaut F, Bellanger M (2005) Filtrage adaptatif : théorie et algorithmes. Hermes Science Publication, Lavoisier
Mueller, P., Sweeney, R., & Baribeau, L. (1984). Acoustic and morphologic study of the senescent voice. Ear, Nose, and Throat Journal, 63, 71–75.
Ons, B., Gemmeke, J. F., Hamme, H. V. (2014). The self-taught vocal interface. EURASIP Journal on Audio, Speech, and Music Processing, 2014, 43.
Parker, M., Cunningham, S., Enderby, P., Hawley, M., & Green, P. (2006). Automatic speech recognition and training for severely dysarthric users of assistive technology: The stardust project. Clinical Linguistics & Phonetics, 20(2–3), 149–156.
Peetoom, K. K. B., Lexis, M. A. S., Joore, M., Dirksen, C. D., De Witte, L. P. (2014). Literature review on monitoring technologies and their outcomes in independently living elderly people. Disability and Rehabilitation: Assistive Technology, 10, 1–24.
Pellegrini, T., Trancoso, I., Hämäläinen, A., Calado, A., Dias, M. S., Braga, D. (2012). Impact of Age in ASR for the Elderly: Preliminary Experiments in European Portuguese. Advances in Speech and Language Technologies for Iberian Languages—IberSPEECH 2012 Conference, Madrid, Spain, November 21-23, 2012. Proceedings, pp. 139–147.
Popescu, M., Li, Y., Skubic, M., Rantz, M. (2008). An acoustic fall detector system that uses sound height information to reduce the false alarm rate. Proceedings of 30th Annual International Conference of the IEEE-EMBS 2008, pp. 4628–4631.
Portet, F., Vacher, M., Golanski, C., Roux, C., & Meillon, B. (2013). Design and evaluation of a smart home voice interface for the elderly—Acceptability and objection aspects. Personal and Ubiquitous Computing, 17(1), 127–144.
Portet F, Christensen H, Rudzicz F, Alexandersson J (2015) Perspectives on Speech and Language Interaction for Daily Assistive Technology: Overall Introduction to the Special Issue Part 3. ACM—Transactions on Speech and Language Processing, 7(2).
Potamianos, G., Neti, C. (2001). Automatic speechreading of impaired speech. AVSP 2001-International Conference on Auditory-Visual Speech Processing.
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011a). The subspace Gaussian mixture model - A structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N. et al. (2011b). The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.
Ravanelli, M., & Omologo, M. (2015). Contaminated speech training methods for robust DNN-HMM distant speech recognition. INTERSPEECH 2015 (pp. 756–760). Dresden, Germany.
Ravanelli, M., Cristoforetti, L., Gretter, R., Pellin, M., Sosi, A., & Omologo, M. (2015). The DIRHA-english corpus and related tasks for distant-speech recognition in domestic environments. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 275–282.
Rudzicz, F. (2011). Acoustic transformations to improve the intelligibility of dysarthric speech. Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, pp. 11–21.
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
Ryan, W., & Burk, K. (1974). Perceptual and acoustic correlates in the speech of males. Journal of Communication Disorders, 7, 181–192.
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., & Plumbley, M. D. (2015). Detection and classification of audio scenes and events. IEEE Transactions on Multimedia, 17(10), 1733–1746.
Takeda, N., Thomas, G., & Ludlow, C. (2000). Aging effects on motor units in the human thyroarytenoid muscle. Laryngoscope, 110, 1018–1025.
Thiemann, J., Vincent, E. (2013). An experimental comparison of source separation and beamforming techniques for microphone array signal enhancement. MLSP—23rd IEEE International Workshop on Machine Learning for Signal Processing, 2013, Southampton, United Kingdom.
Vacher, M., Serignat, J., Chaillol, S., Istrate, D., & Popescu, V. (2006). Speech and sound use in a remote monitoring system for health care. In: P. Sojka & K. P. I Kopecek (eds) Text speech and dialogue, LNCS 4188/2006, Springer, Berlin, Vol. 4188/2006, pp. 711–718.
Vacher, M., Portet, F., Fleury, A., & Noury, N. (2011). Development of audio sensing technology for ambient assisted living: Applications and challenges. International Journal of E-Health and Medical Communications, 2(1), 35–54.
Vacher, M., Lecouteux, B., & Portet, F. (2012). Recognition of voice commands by multisource ASR and noise cancellation in a smart home environment. EUSIPCO (European Signal Processing Conference), Bucarest, Romania, pp. 1663–1667. https://hal.inria.fr/hal-00953511.
Vacher, M., Lecouteux, B., Chahuara, P., Portet, F., Meillon, B., & Bonnefond, N. (2014). The Sweet-Home speech and multimodal corpus for home automation interaction. The 9th edition of the Language Resources and Evaluation Conference (LREC) (pp. 4499–4506). Reykjavik, Iceland.
Vacher, M., Caffiau, S., Portet, F., Meillon, B., Roux, C., Elias, E., Lecouteux, B., Chahuara, P. (2015a). Evaluation of a context-aware voice interface for Ambient Assisted Living: Qualitative user study vs. quantitative system evaluation. ACM Transactions on Accessible Computing, 7(2), 5:1–5:36.
Vacher, M., Lecouteux, B., Serrano-Romero, J., Ajili, M., Portet, F., Rossato, S. (2015b). Speech and speaker recognition for home automation: Preliminary results. 8th International Conference Speech Technology and Human-Computer Dialogue ”SpeD 2015”, IEEE, Bucarest, Romania, Proceedings of the 8th International Conference Speech Technology and Human-Computer Dialogue, pp. 181–190.
Vacher, M., Bouakaz, S., Bobillier Chaumon, M. E., Aman, F., Khan, R. A., & Bekkadja, S., et al. (2016). The CIRDO corpus: comprehensive audio/video database of domestic falls of elderly people. 10th International Conference on Language Resources and Evaluation (LREC 2016), ELRA (pp. 1389–1396). Portoroz, Slovenia.
Valin, J. M. (2006). Speex: A free codec for free speech. Australian National Linux Conference, Dunedin, New Zealand.
Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., & Matassoni, M. (2013). The Second ’CHiME’ Speech Separation and Recognition Challenge: An overview of challenge systems and outcomes. 2013 IEEE Automatic Speech Recognition and Understanding Workshop (pp. 162–167). Olomouc, Czech Republic.
Vipperla, R., Renals, S., & Frankel, J. (2008). Longitudinal study of ASR performance on ageing voices. 9th International Conference on Speech Science and Speech Technology (InterSpeech 2008) (pp. 2550–2553). Brisbane, Australia.
Vipperla, R. C., Wolters, M., Georgila, K., Renals, S. (2009). Speech input from older users in smart environments: Challenges and perspectives. HCI Internat.: Universal Access in Human-Computer Interaction. Intelligent and Ubiquitous Interaction Environments.
Vlasenko, B., Prylipko, D., Philippou-Hübner, D., & Wendemuth, A. (2011). Vowels formants analysis allows straightforward detection of high arousal acted and spontaneous emotions. Proceedings of Interspeech, 2011, 1577–1580.
Vlasenko, B., Prylipko, D., & Wendemuth, A. (2012). Towards robust spontaneous speech recognition with emotional speech adapted acoustic models. Proceedings of the KI 2012.
Wölfel, M., & McDonough, J. (2009). Distant speech recognition. Hoboken: Wiley.
World Health Organization (2003). What are the main risk factors for disability in old age and how can disability be prevented? Available from: http://www.euro.who.int/document/E82970.pdf.
Xu, H., Povey, D., Mangu, L., & Zhu, J. (2011). Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 25(4), 802–828. https://doi.org/10.1016/j.csl.2011.03.001, http://www.sciencedirect.com/science/article/pii/S0885230811000192.
Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Yu, M. F. C., et al. (2015). The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices. IEEE Automatic Speech Recognition and Understanding Workshop.
Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 215–219). Florence, Italy.
Zouari, L., Chollet, G. (2006). Efficient gaussian mixture for speech recognition. 18th International Conference on Pattern Recognition, 2006. ICPR 2006, Vol. 4, pp. 294–297. https://doi.org/10.1109/ICPR.2006.475.
Acknowledgements
This work is supported by the Agence Nationale de la Recherche under grant ANR-09-VERS-011. The authors would like to thank the participants who accepted to perform the experiments.
Author information
Authors and Affiliations
Corresponding author
Appendix: Composition of the different corpora
Appendix: Composition of the different corpora
All corpora were recorded in distant speech conditions with the exception of voix détresse. Corpora used for training are detailed in Appendix Sect. 7.1 and those used for development and test are in Appendix Sect. 7.2. Training corpora are made of in distant speech and multichannel conditions with microphones each at a distance of about 1 to 2 m of the nearest one, and of expressive speech. Development and testing corpora were recorded in distant speech and multichannel conditions by persons interacting with the voice command system sweet-home.
Each sentence was manually annotated on the best Signal-to-Noise Ratio (SNR) channel using Transcriber. Moreover, regarding the user specific set, an automatic transcription is available, that was obtained using the patsh software operating line during the experiments while participants interacted with the sweet-home system. This set is important because it would be possible, using it, to determine the performances that can be achieved using a fully automatic system in a smart home application.
1.1 Training corpora
The detailed composition of each corpus is presented in Appendix Sect. 7.1.1 (Appendix Table 5a) for the Multimodal subset, in Appendix Sect. 7.1.2 (Appendix Table 5b) for the Cirdo Set corpus, in Appendix Sect. 7.1.3 (Appendix Table 6a) for the Home Automation corpus and in Appendix Sect. 7.1.4 (Appendix Table 6 for the Voix Détresse corpus).
1.1.1 Multimodal
The Multimodal subset of the sweet-home corpus (Vacher et al. 2014) was recorded by 21 participants (7 females and 14 males) to train models for automatic human activity recognition and location. These two types of information are crucial for context aware decision making in smart home. For instance, a voice command such as “allume la lumière” (turn on the light) cannot be handled properly without the knowledge of the user’s location. The experiment consisted in following a scenario of activities without condition on the time spent and the manner of achieving them (e.g. having a talk on the phone, having a breakfast, simulating a shower, getting some sleep, cleaning up the flat using the vacuum, etc.). During the experiment, even tracks from the home automation network, audio and video sensors were captured. Speech was recorded using 7 microphones set up in the ceiling directed towards ground in the domus smart home (see Fig. 1).
In total, more than 26 h of data have been acquired (audio, home automation sensors and videos). The speech part is made of a telephonic conversation at the office, that represents 1785 sentences and 43 min and 27 s of speech signal. No instruction was given to the participants about how they should speak or in which direction.
1.1.2 Cirdo
The Cirdo corpus (Vacher et al. 2016) was recorded by 17 participants (9 men and 8 women) with average age of 40 years old (SD 19.5). This corpus was recorded in the framework of a project aiming at the development of a system able to recognize calls for help in the home of seniors in order to provide reassurance and assistance. Among them 13 people were under 60 and worn a simulator which hampered their mobility and reduced their vision and hearing to simulate aged physical conditions. The persons of the aged group (4 participants) were between 61 and 83 years old (mean 68.5). The persons of the young group were between 16 and 52 years old.
The participants simulated five options chosen from the 28 risky situations identified (1 slip, stumble 1, 2 falls in a stationary position and a position of hip blocked on the sofa). These situations were selected because they were representative falling downs at home and because they could safely be simulated by the participants. During the scenario, the participant uttered calls for help, some of them were part of the scenario but others were spontaneous speech. In these 414 calls or sentences were isolated, this represents 15 min and 45 s of speech. Due to the recording conditions, this corpus is made of expressive speech because the participants are in disturbing situations, i.e. when they fall down on the carpet. An unique microphone was used.
The interest in use of such a corpus for training is that participants spoke in a spontaneous way with affects in the voice, even if sentences were learnt at the beginning of the experiment. Therefore it is like real condition at home compared to usual corpora.
1.1.3 Home automation
The Home Automation Speech subset of the sweet-home corpus (Vacher et al. 2014) was recorded by 23 speakers (9 females and 14 males) to develop robust automatic recognition of voice commands in a smart home in distant conditions. The audio channels were recorded to acquire a representative speech corpus composed of utterances of not only home automation commands and distress calls, but also colloquial sentences in clean or noisy conditions. No instruction was given to the participants about how they should speak or in which direction. Speech was recorded using 7 microphones set up in the ceiling directed towards ground in the domus smart home (see Fig. 1).
The home automation commands follow a more simplified grammar than one defined for the test in Appendix Sect. 4.2.1. The non noisy part is composed, for each speaker, of a text of 285 words for acoustic adaptation (36 min for 351 sentences in total for the 23 speakers), and of 240 short sentences (2 hours and 30 min per channel in total for the 23 speakers). In clean condition, 1076 voice commands and 348 distress calls were uttered. With a total of 5340 sentences overall, this corpus is made of 3 hours and 45 s of speech signal.
1.1.4 Voix Détresse
The Voix Détresse French corpus was recorded in the domus smart home in order to determine if ASR performances can be affected by expressive speech (Aman et al. 2016). Firstly, speakers had to read 20 distress sentences in a neutral manner, these sentences were extracted from the AD80 corpus(Vacher et al. 2006). Then, elicited emotions were recorded: a photography showing a person in a distress situation was associated to each sentence, the participants were asked to stand in that individuals shoes and to utter in an expressive manner. Desired emotions were mainly negative emotions like fear, anger, sadness.
This corpus was recorded using a microphone by 5 elderly speakers and 11 younger speakers. It is made of 1164 neutral and expressive sentences, its duration is 18mn 45s. The interest in use of such a corpus for training is that it is made of neutral and expressive sentences. Therefore it is nearest to real record condition at home than usual corpora.
1.2 Development and testing corpora
The Interaction and User Specific subset of the sweet-home corpus was recorded in realistic conditions according to the conditions described in Appendix Sect. 3.2, thanks to the participation of 16 people (7 female and 9 male, minimal age 19 years, maximal age 62 years) for the first one. The experiment duration was 8h 52mn, and 993 sentences were recorded and annotated in the same conditions that for the Training subset. The participants were in realistic life conditions and must retrieve themselves the voice command appropriate to the situation, so they don’t respect perfectly the grammar: particularly, the keyword was frequently omitted or uttered a long time before the command itself. The second one implied elderly people (5 women, minimal age 74 years, maximal age 91 years) and visually impaired people (2 women and 3 men, minimal age 49 years, maximal age 66 years), for this reason, scenarios were a little simplified.
Development and testing corpora manually transcribed are detailed in Appendix Table 7 for the interaction corpus and in Appendix Table 8 for the user specific one. The number of voice commands is different for each speaker because if a voice command was not correctly recognized, the requested action was not directed by the intelligent controller (light on or off, curtains up or down...) and thus the speaker often uttered the command two or three times.
Moreover, regarding the user specific set, an automatic transcription is available, it was obtained using the patsh software operating online during the experiments while participants interacted with the sweet-home system; this set is described in Appendix Table 9 but was not used in the framework of this paper.
In a nutshell, two corpora recorded in realistic conditions are available for test and development, that is 21mn 28s and 17 mn 48s of manually transcribed data and 22 mn 28 s of automatically transcribed data.
Rights and permissions
About this article
Cite this article
Lecouteux, B., Vacher, M. & Portet, F. Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int J Speech Technol 21, 601–618 (2018). https://doi.org/10.1007/s10772-018-9520-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9520-y