The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Wang, Zhe; Wu, Shilong; Chen, Hang; He, Mao-Kui; Du, Jun; Lee, Chin-Hui; Chen, Jingdong; Watanabe, Shinji; Siniscalchi, Sabato; Scharenborg, Odette; Liu, Diyuan; Yin, Baocai; Pan, Jia; Gao, Jianqing; Liu, Cong

Computer Science > Multimedia

arXiv:2303.06326 (cs)

[Submitted on 11 Mar 2023]

Title:The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Authors:Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu

View PDF

Abstract:The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

Comments:	5 pages, 4 figures, to be published in ICASSP2023
Subjects:	Multimedia (cs.MM)
Cite as:	arXiv:2303.06326 [cs.MM]
	(or arXiv:2303.06326v1 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2303.06326

Submission history

From: Zhe Wang [view email]
[v1] Sat, 11 Mar 2023 06:56:10 UTC (325 KB)

Computer Science > Multimedia

Title:The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators