Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments

Ke, Shanfa; Wang, Zhongyuan; Hu, Ruimin; Wang, Xiaochen

doi:10.1007/s11063-022-10887-6

Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments

Published: 10 June 2022

Volume 55, pages 385–400, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Shanfa Ke^1,2,
Zhongyuan Wang ORCID: orcid.org/0000-0002-9796-488X^1,2,
Ruimin Hu^1,3 &
…
Xiaochen Wang^1,3

343 Accesses
1 Altmetric
Explore all metrics

Abstract

In a real multi-speaker scenario, the signal collected by the microphone contains a large number of time periods with only one speaker’s speech which were called isolated speech segments. In view of this fact, this paper proposes a single-channel multi-speaker speech separation method based on the similarity between the speaker feature center and the mixture feature in the deep embedding space. In particular, the isolated speech segments extracted from the observed signal are converted to deep embedding vectors, and then a speaker feature center will be created. The similarity between this center and the deep embedding feature of mixture is constructed as a mask of the corresponding speaker, which is used to separate the speaker’s speech. A residual-based deep embedding network with stacked 2-D convolutional blocks instead of bi-directional long short-term memory is proposed for faster speed and better feature extraction. In addition, an isolated speech segment extraction method based on Chimera++ has been proposed, because the previous experiments showed that Chimera++ algorithm owns good separation performance for segments from only one speaker. The evaluation results on the general datasets show that the proposed method substantially outperforms competing algorithms up to 0.94 dB in Signal-to-Distortion Ratio.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Triple-Path RNN Network: A Time-and-Frequency Joint Domain Speech Separation Model

An End-to-End Speech Separation Method Based on Features of Two Domains

Article 12 February 2024

Voice Separation Using Multi Learning on Squash-Norm Embedding Matrix and Mask

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Wang Z-Q, Wang DL (2016) A joint training framework for robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(4):796–806
Article Google Scholar
Li J, Deng L, Gong Y, HaebUmbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):745–777
Article Google Scholar
Narayanan A, Wang DL (2014) Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):826–835
Article Google Scholar
Pedersen MS (2006) Source separation for hearing aid applications. IMM, Informatik og Matematisk Modelling, DTU, Lyngby
Google Scholar
Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430
Article Google Scholar
Aarabi P, Shi G, Jahromi O (2003) Robust speech separation using time-frequency masking. In: 2003 International conference on multimedia and expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol 1. IEEE, pp I–741
Alinaghi A, Jackson Philip JB, Liu Q, Wang W (2014) Joint mixing vector and binaural model based stereo source separation. IEEE/ACM Trans Audio Speech Lang Process 22(9):1434–1448
Article Google Scholar
Wang Y, Han K, Wang DL (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279
Article Google Scholar
Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Article Google Scholar
Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074
Article Google Scholar
Virtanen T, Gemmeke JF, Raj B (2013) Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans Audio Speech Lang Process 21(11):2277–2289
Article Google Scholar
Huang P-S, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147
Article Google Scholar
Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245
Kolbæk M, Dong Yu, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913
Article Google Scholar
Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 246–250
Wang Z-Q, Le Roux J, Hershey JR (2018) Alternative objective functions for deep clustering. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 686–690
Wang Z-Q, Le Roux J, Wang DL, Hershey JR (2018) End-to-end speech separation with unfolded iterative phase reconstruction. arXiv:1804.10204
Wang DL, Wang ZQ, Tan K (2019) Deep learning-based phase reconstruction for speaker separation: a trigonometric perspective, pp 71–75
Luo Yi, Mesgarani Nima (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 696–700
Luo Y, Mesgarani N (2019) Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266
Article Google Scholar
Tzinis E, Venkataramani S, Wang Z, Subakan C, Smaragdis P (2020) Two-step sound source separation: training on learned latent targets. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35
Luo Y, Chen Z, Yoshioka T (2020) Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 46–50
Zeghidour N, Grangier D (2020) Wavesplit: end-to-end speech separation by speaker clustering. arXiv:2002.08933
Aihara R, Wichern G, Le Roux J (2020) Deep clustering-based single-channel speech separation and recent advances. Acoust Sci Technol 41(2):465–471
Article Google Scholar
Zheng X, Ritz C, Xi J (2012) Encoding navigable speech sources: a psychoacoustic-based analysis-by-synthesis approach. IEEE Trans Audio Speech Lang Process 21(1):29–38
Article Google Scholar
Vincent E, Gribonval R, Févotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469
Article Google Scholar
Raffel C, McFee B, Humphrey EJ, Salamon J, Nieto O, Liang D, Ellis DPW, Raffel CC (2014) mir_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th international society for music information retrieval conference, ISMIR. Citeseer
http://labrosa.ee.columbia.edu/mireval/
Ke S, Hu R, Li G, Wu T, Wang X, Wang Z (2019) Multi-speakers speech separation based on modified attractor points estimation and GMM clustering. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1414–1419

Download references

Acknowledgements

This work is supported by the National Key R &D Program of China (No. 2017YFB1002803), National Nature Science Foundation of China (No. 61761044, U1903214, U1736206), Basic Research Project of Science and Technology Plan of Shenzhen (JCYJ20170818143246278), and Hubei Province Technological Innovation Major Project (No. 2017AAA123, 2019AAA049).

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, China
Shanfa Ke, Zhongyuan Wang, Ruimin Hu & Xiaochen Wang
Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, 430072, China
Shanfa Ke & Zhongyuan Wang
Research Institute of Wuhan University in Shenzhen, Shenzhen, 518057, China
Ruimin Hu & Xiaochen Wang

Authors

Shanfa Ke
View author publications
You can also search for this author in PubMed Google Scholar
Zhongyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruimin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhongyuan Wang or Ruimin Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ke, S., Wang, Z., Hu, R. et al. Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments. Neural Process Lett 55, 385–400 (2023). https://doi.org/10.1007/s11063-022-10887-6

Download citation

Accepted: 09 May 2022
Published: 10 June 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11063-022-10887-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Triple-Path RNN Network: A Time-and-Frequency Joint Domain Speech Separation Model

An End-to-End Speech Separation Method Based on Features of Two Domains

Voice Separation Using Multi Learning on Squash-Norm Embedding Matrix and Mask

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Triple-Path RNN Network: A Time-and-Frequency Joint Domain Speech Separation Model

An End-to-End Speech Separation Method Based on Features of Two Domains

Voice Separation Using Multi Learning on Squash-Norm Embedding Matrix and Mask

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation