Abstract
In a real multi-speaker scenario, the signal collected by the microphone contains a large number of time periods with only one speaker’s speech which were called isolated speech segments. In view of this fact, this paper proposes a single-channel multi-speaker speech separation method based on the similarity between the speaker feature center and the mixture feature in the deep embedding space. In particular, the isolated speech segments extracted from the observed signal are converted to deep embedding vectors, and then a speaker feature center will be created. The similarity between this center and the deep embedding feature of mixture is constructed as a mask of the corresponding speaker, which is used to separate the speaker’s speech. A residual-based deep embedding network with stacked 2-D convolutional blocks instead of bi-directional long short-term memory is proposed for faster speed and better feature extraction. In addition, an isolated speech segment extraction method based on Chimera++ has been proposed, because the previous experiments showed that Chimera++ algorithm owns good separation performance for segments from only one speaker. The evaluation results on the general datasets show that the proposed method substantially outperforms competing algorithms up to 0.94 dB in Signal-to-Distortion Ratio.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wang Z-Q, Wang DL (2016) A joint training framework for robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(4):796–806
Li J, Deng L, Gong Y, HaebUmbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):745–777
Narayanan A, Wang DL (2014) Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):826–835
Pedersen MS (2006) Source separation for hearing aid applications. IMM, Informatik og Matematisk Modelling, DTU, Lyngby
Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430
Aarabi P, Shi G, Jahromi O (2003) Robust speech separation using time-frequency masking. In: 2003 International conference on multimedia and expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol 1. IEEE, pp I–741
Alinaghi A, Jackson Philip JB, Liu Q, Wang W (2014) Joint mixing vector and binaural model based stereo source separation. IEEE/ACM Trans Audio Speech Lang Process 22(9):1434–1448
Wang Y, Han K, Wang DL (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279
Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074
Virtanen T, Gemmeke JF, Raj B (2013) Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans Audio Speech Lang Process 21(11):2277–2289
Huang P-S, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147
Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245
Kolbæk M, Dong Yu, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913
Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 246–250
Wang Z-Q, Le Roux J, Hershey JR (2018) Alternative objective functions for deep clustering. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 686–690
Wang Z-Q, Le Roux J, Wang DL, Hershey JR (2018) End-to-end speech separation with unfolded iterative phase reconstruction. arXiv:1804.10204
Wang DL, Wang ZQ, Tan K (2019) Deep learning-based phase reconstruction for speaker separation: a trigonometric perspective, pp 71–75
Luo Yi, Mesgarani Nima (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 696–700
Luo Y, Mesgarani N (2019) Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266
Tzinis E, Venkataramani S, Wang Z, Subakan C, Smaragdis P (2020) Two-step sound source separation: training on learned latent targets. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
Luo Y, Chen Z, Yoshioka T (2020) Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 46–50
Zeghidour N, Grangier D (2020) Wavesplit: end-to-end speech separation by speaker clustering. arXiv:2002.08933
Aihara R, Wichern G, Le Roux J (2020) Deep clustering-based single-channel speech separation and recent advances. Acoust Sci Technol 41(2):465–471
Zheng X, Ritz C, Xi J (2012) Encoding navigable speech sources: a psychoacoustic-based analysis-by-synthesis approach. IEEE Trans Audio Speech Lang Process 21(1):29–38
Vincent E, Gribonval R, Févotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469
Raffel C, McFee B, Humphrey EJ, Salamon J, Nieto O, Liang D, Ellis DPW, Raffel CC (2014) mir_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th international society for music information retrieval conference, ISMIR. Citeseer
Ke S, Hu R, Li G, Wu T, Wang X, Wang Z (2019) Multi-speakers speech separation based on modified attractor points estimation and GMM clustering. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1414–1419
Acknowledgements
This work is supported by the National Key R &D Program of China (No. 2017YFB1002803), National Nature Science Foundation of China (No. 61761044, U1903214, U1736206), Basic Research Project of Science and Technology Plan of Shenzhen (JCYJ20170818143246278), and Hubei Province Technological Innovation Major Project (No. 2017AAA123, 2019AAA049).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ke, S., Wang, Z., Hu, R. et al. Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments. Neural Process Lett 55, 385–400 (2023). https://doi.org/10.1007/s11063-022-10887-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10887-6