iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s11063-022-10887-6
Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments | Neural Processing Letters Skip to main content
Log in

Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In a real multi-speaker scenario, the signal collected by the microphone contains a large number of time periods with only one speaker’s speech which were called isolated speech segments. In view of this fact, this paper proposes a single-channel multi-speaker speech separation method based on the similarity between the speaker feature center and the mixture feature in the deep embedding space. In particular, the isolated speech segments extracted from the observed signal are converted to deep embedding vectors, and then a speaker feature center will be created. The similarity between this center and the deep embedding feature of mixture is constructed as a mask of the corresponding speaker, which is used to separate the speaker’s speech. A residual-based deep embedding network with stacked 2-D convolutional blocks instead of bi-directional long short-term memory is proposed for faster speed and better feature extraction. In addition, an isolated speech segment extraction method based on Chimera++ has been proposed, because the previous experiments showed that Chimera++ algorithm owns good separation performance for segments from only one speaker. The evaluation results on the general datasets show that the proposed method substantially outperforms competing algorithms up to 0.94 dB in Signal-to-Distortion Ratio.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Wang Z-Q, Wang DL (2016) A joint training framework for robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(4):796–806

    Article  Google Scholar 

  2. Li J, Deng L, Gong Y, HaebUmbach R (2014) An overview of noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):745–777

    Article  Google Scholar 

  3. Narayanan A, Wang DL (2014) Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(4):826–835

    Article  Google Scholar 

  4. Pedersen MS (2006) Source separation for hearing aid applications. IMM, Informatik og Matematisk Modelling, DTU, Lyngby

    Google Scholar 

  5. Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430

    Article  Google Scholar 

  6. Aarabi P, Shi G, Jahromi O (2003) Robust speech separation using time-frequency masking. In: 2003 International conference on multimedia and expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol 1. IEEE, pp I–741

  7. Alinaghi A, Jackson Philip JB, Liu Q, Wang W (2014) Joint mixing vector and binaural model based stereo source separation. IEEE/ACM Trans Audio Speech Lang Process 22(9):1434–1448

    Article  Google Scholar 

  8. Wang Y, Han K, Wang DL (2012) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270–279

    Article  Google Scholar 

  9. Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858

    Article  Google Scholar 

  10. Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074

    Article  Google Scholar 

  11. Virtanen T, Gemmeke JF, Raj B (2013) Active-set newton algorithm for overcomplete non-negative representations of audio. IEEE/ACM Trans Audio Speech Lang Process 21(11):2277–2289

    Article  Google Scholar 

  12. Huang P-S, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans Audio Speech Lang Process 23(12):2136–2147

    Article  Google Scholar 

  13. Yu D, Kolbæk M, Tan Z-H, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 241–245

  14. Kolbæk M, Dong Yu, Tan Z-H, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10):1901–1913

    Article  Google Scholar 

  15. Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35

  16. Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 246–250

  17. Wang Z-Q, Le Roux J, Hershey JR (2018) Alternative objective functions for deep clustering. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 686–690

  18. Wang Z-Q, Le Roux J, Wang DL, Hershey JR (2018) End-to-end speech separation with unfolded iterative phase reconstruction. arXiv:1804.10204

  19. Wang DL, Wang ZQ, Tan K (2019) Deep learning-based phase reconstruction for speaker separation: a trigonometric perspective, pp 71–75

  20. Luo Yi, Mesgarani Nima (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 696–700

  21. Luo Y, Mesgarani N (2019) Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process 27(8):1256–1266

    Article  Google Scholar 

  22. Tzinis E, Venkataramani S, Wang Z, Subakan C, Smaragdis P (2020) Two-step sound source separation: training on learned latent targets. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 31–35

  23. Luo Y, Chen Z, Yoshioka T (2020) Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 46–50

  24. Zeghidour N, Grangier D (2020) Wavesplit: end-to-end speech separation by speaker clustering. arXiv:2002.08933

  25. Aihara R, Wichern G, Le Roux J (2020) Deep clustering-based single-channel speech separation and recent advances. Acoust Sci Technol 41(2):465–471

    Article  Google Scholar 

  26. Zheng X, Ritz C, Xi J (2012) Encoding navigable speech sources: a psychoacoustic-based analysis-by-synthesis approach. IEEE Trans Audio Speech Lang Process 21(1):29–38

    Article  Google Scholar 

  27. Vincent E, Gribonval R, Févotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469

    Article  Google Scholar 

  28. Raffel C, McFee B, Humphrey EJ, Salamon J, Nieto O, Liang D, Ellis DPW, Raffel CC (2014) mir_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th international society for music information retrieval conference, ISMIR. Citeseer

  29. http://labrosa.ee.columbia.edu/mireval/

  30. Ke S, Hu R, Li G, Wu T, Wang X, Wang Z (2019) Multi-speakers speech separation based on modified attractor points estimation and GMM clustering. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1414–1419

Download references

Acknowledgements

This work is supported by the National Key R &D Program of China (No. 2017YFB1002803), National Nature Science Foundation of China (No. 61761044, U1903214, U1736206), Basic Research Project of Science and Technology Plan of Shenzhen (JCYJ20170818143246278), and Hubei Province Technological Innovation Major Project (No. 2017AAA123, 2019AAA049).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhongyuan Wang or Ruimin Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ke, S., Wang, Z., Hu, R. et al. Single-channel Multi-speakers Speech Separation Based on Isolated Speech Segments. Neural Process Lett 55, 385–400 (2023). https://doi.org/10.1007/s11063-022-10887-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10887-6

Keywords

Navigation