Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System
Abstract
:1. Introduction
- We introduce a novel audio-visual scene-aware dialog system with natural-language-driven multimodal representation learning through which the system can infer all information by sequentially encoding the keywords obtained from each modality into the transformer-based language model;
- We also propose a response-driven temporal moment localization method in which the system itself provides the user with the segment of the video that the system referred to for response generation;
- In addition to the ability to generate responses with improved quality, the proposed model showed robust performance even in an environment using all three modalities of information, including audio. With regard to the system response reasoning task, our proposed method achieved state-of-the-art performance.
2. Related Works
2.1. Video-Grounded Text Generation
2.2. Audio-Visual Scene-Aware Dialog
3. Proposed Architecture
3.1. Event Keyword-Driven Multimodal Integration Using a Language Model
3.1.1. Audio Event Detector
3.1.2. Video Event Detector
3.2. Response Generation
3.3. Response-Driven Temporal Moment Localization for System-Generated Response Verification
3.3.1. Modality Detection
3.3.2. Modality-Specific Temporal Moment Localization Network
4. Experiment
4.1. Experimental Setup
4.1.1. Dataset
4.1.2. Implementation Details
4.2. Evaluation Metrics
4.3. Experimental Result
5. Discussion
5.1. The Performance of Modality-Specific Event Keyword Extraction
5.2. The Effects of the Number of Event Keywords
5.3. Ablation Study for Response Verification
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Turk, M. Multimodal interaction: A review. Pattern Recognit. Lett. 2014, 36, 189–195. [Google Scholar] [CrossRef]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar] [CrossRef]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 121–137. [Google Scholar]
- Mokady, R.; Hertz, A.; Bermano, A.H. ClipCap: CLIP Prefix for Image Captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5579–5588. [Google Scholar]
- Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
- Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17980–17989. [Google Scholar]
- Aafaq, N.; Akhtar, N.; Liu, W.; Gilani, S.Z.; Mian, A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12487–12496. [Google Scholar]
- Li, L.; Lei, J.; Gan, Z.; Yu, L.; Chen, Y.C.; Pillai, R.; Cheng, Y.; Zhou, L.; Wang, X.E.; Wang, W.Y.; et al. Value: A multi-task benchmark for video-and-language understanding evaluation. arXiv 2021, arXiv:2106.04632. [Google Scholar]
- Liu, S.; Ren, Z.; Yuan, J. Sibnet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1425–1434. [Google Scholar]
- Pan, B.; Cai, H.; Huang, D.A.; Lee, K.H.; Gaidon, A.; Adeli, E.; Niebles, J.C. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10870–10879. [Google Scholar]
- Pei, W.; Zhang, J.; Wang, X.; Ke, L.; Shen, X.; Tai, Y.W. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8347–8356. [Google Scholar]
- Shi, B.; Ji, L.; Niu, Z.; Duan, N.; Zhou, M.; Chen, X. Learning semantic concepts and temporal alignment for narrated video procedural captioning. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 4355–4363. [Google Scholar]
- Zhang, Z.; Qi, Z.; Yuan, C.; Shan, Y.; Li, B.; Deng, Y.; Hu, W. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9837–9846. [Google Scholar]
- Chen, S.; Yao, T.; Jiang, Y.G. Deep Learning for Video Captioning: A Review. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, Macao, China, 10–16 August 2019; pp. 6283–6290. [Google Scholar] [CrossRef]
- Alamri, H.; Cartillier, V.; Lopes, R.G.; Das, A.; Wang, J.; Essa, I.; Batra, D.; Parikh, D.; Cherian, A.; Marks, T.K.; et al. Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7. arXiv 2018, arXiv:1806.00525. [Google Scholar]
- Hori, C.; Alamri, H.; Wang, J.; Wichern, G.; Hori, T.; Cherian, A.; Marks, T.K.; Cartillier, V.; Lopes, R.G.; Das, A.; et al. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 2352–2356. [Google Scholar] [CrossRef]
- Liao, L.; Ma, Y.; He, X.; Hong, R.; Chua, T.S. Knowledge-Aware Multimodal Dialogue Systems. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 801–809. [Google Scholar] [CrossRef]
- Nie, L.; Wang, W.; Hong, R.; Wang, M.; Tian, Q. Multimodal Dialog System: Generating Responses via Adaptive Decoders. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1098–1106. [Google Scholar] [CrossRef]
- Huang, Y.; Xue, H.; Liu, B.; Lu, Y. Unifying Multimodal Transformer for Bi-Directional Image and Text Generation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 1138–1147. [Google Scholar] [CrossRef]
- Li, Z.; Li, Z.; Zhang, J.; Feng, Y.; Zhou, J. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2476–2483. [Google Scholar] [CrossRef]
- Pasunuru, R.; Bansal, M. DSTC7-AVSD: Scene-Aware Video-Dialogue Systems with Dual Attention. In Proceedings of the DSTC7 at AAAI2019 Workshop, Honolulu, HI, USA, 27 January 2019. [Google Scholar]
- Schwartz, I.; Schwing, A.G.; Hazan, T. A Simple Baseline for Audio-Visual Scene-Aware Dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12548–12558. [Google Scholar]
- Das, P.; Xu, C.; Doell, R.F.; Corso, J.J. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 25–27 June 2013; pp. 2634–2641. [Google Scholar]
- Kojima, A.; Tamura, T.; Fukunaga, K. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 2002, 50, 171–184. [Google Scholar] [CrossRef]
- Guadarrama, S.; Krishnamoorthy, N.; Malkarnenkar, G.; Venugopalan, S.; Mooney, R.; Darrell, T.; Saenko, K. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, NSW, Australia, 1–8 December 2013; pp. 2712–2719. [Google Scholar]
- Krishnamoorthy, N.; Malkarnenkar, G.; Mooney, R.; Saenko, K.; Guadarrama, S. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI conference on artificial intelligence, Bellevue, WA, USA, 14–18 July 2013; Volume 27, pp. 541–547. [Google Scholar]
- Rohrbach, M.; Qiu, W.; Titov, I.; Thater, S.; Pinkal, M.; Schiele, B. Translating video content to natural language descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 25–27 June 2013; pp. 433–440. [Google Scholar]
- Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L. Semantic Compositional Networks for Visual Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5630–5639. [Google Scholar]
- Yuan, J.; Tian, C.; Zhang, X.; Ding, Y.; Wei, W. Video Captioning with Semantic Guiding. In Proceedings of the 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), Xi’an, China, 13–16 September 2018; pp. 1–5. [Google Scholar] [CrossRef]
- Perez-Martin, J.; Bustos, B.; Perez, J. Improving Video Captioning With Temporal Composition of a Visual-Syntactic Embedding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; pp. 3039–3049. [Google Scholar]
- Chen, S.; Jiang, Y.G. Motion Guided Region Message Passing for Video Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 1543–1552. [Google Scholar]
- Lin, K.; Li, L.; Lin, C.C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; Wang, L. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17949–17958. [Google Scholar]
- Seo, P.H.; Nagrani, A.; Arnab, A.; Schmid, C. End-to-End Generative Pretraining for Multimodal Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17959–17968. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Chu, Y.W.; Lin, K.Y.; Hsu, C.C.; Ku, L.W. Multi-step joint-modality attention network for scene-aware dialogue system. arXiv 2020, arXiv:2001.06206. [Google Scholar]
- Shah, A.P.; Hori, T.; Le Roux, J.; Hori, C. DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning. In Proceedings of the DSTC10 Workshop at AAAI2022 Workshop, Virtual, 28 February–1 March 2022. [Google Scholar]
- Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; pp. 3202–3211. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Zhang, S.; Peng, H.; Fu, J.; Luo, J. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; AAAI Press: Washington, DC, USA, 2020; pp. 12870–12877. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 25 July 2004; pp. 74–81. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 25–30 June 2005; pp. 65–72. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015; pp. 4566–4575. [Google Scholar]
- Huang, X.; Tan, H.L.; Leong, M.C.; Sun, Y.; Li, L.; Jiang, R.; Kim, J.J. Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog. In Proceedings of the DSTC10 Workshop at AAAI2022 Workshop, Virtual, 28 February–1 March 2022. [Google Scholar]
- Luo, H.; Ji, L.; Shi, B.; Huang, H.; Duan, N.; Li, T.; Li, J.; Bharti, T.; Zhou, M. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv 2020, arXiv:2002.06353. [Google Scholar]
audio, audible, noise, sound, hear anything, can you hear, do you hear, speak, talk, talking, conversation, say anything, saying, dialogue, bark, meow, crying, laughing, singing, cough, sneeze, knock, music, song |
Models | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROGUE-L | CIDEr | Human Rating |
---|---|---|---|---|---|---|---|---|
Baseline | 0.5716 | 0.4223 | 0.3196 | 0.2469 | 0.1909 | 0.4386 | 0.5657 | 2.851 |
Our model | ||||||||
T + V | 0.6409 | 0.4897 | 0.3764 | 0.2946 | 0.2274 | 0.5022 | 0.7891 | - |
T + V + A | 0.6406 | 0.4885 | 0.3786 | 0.2984 | 0.2251 | 0.5016 | 0.8039 | - |
T + V + A + S | 0.6455 | 0.4889 | 0.3796 | 0.2986 | 0.2253 | 0.4991 | 0.7868 | 3.300 |
MED-CAT [51] | 0.6730 | 0.5450 | 0.4480 | 0.3720 | 0.2430 | 0.5300 | 0.9120 | 3.569 |
Models | IoU-1 | IoU-2 |
---|---|---|
baseline | 0.3614 | 0.3798 |
MED-CAT [51] | 0.4850 | 0.5100 |
Proposed Model | 0.5157 | 0.5443 |
Top N | Precision@N (P@N) | Recall@N (R@N) | F1-Score (F1) |
---|---|---|---|
N = 5 | 0.333 | 0.219 | 0.264 |
N = 6 | 0.367 | 0.291 | 0.324 |
N = 7 | 0.348 | 0.322 | 0.334 |
N = 8 | 0.358 | 0.381 | 0.370 |
N = 9 | 0.363 | 0.439 | 0.398 |
N = 10 | 0.367 | 0.492 | 0.420 |
Top N | Precision@N (P@N) | Recall@N (R@N) | F1-Score (F1) |
---|---|---|---|
N = 1 | 0.30 | 0.120 | 0.171 |
N = 2 | 0.28 | 0.223 | 0.248 |
N = 3 | 0.253 | 0.313 | 0.280 |
N = 4 | 0.22 | 0.353 | 0.271 |
N = 5 | 0.208 | 0.409 | 0.276 |
# of Keywords (K) | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
K = 3 | 0.601 | 0.451 | 0.347 | 0.282 | 0.225 | 0.499 | 0.607 |
K = 5 | 0.624 | 0.475 | 0.366 | 0.286 | 0.225 | 0.502 | 0.7970 |
K = 8 | 0.6455 | 0.4889 | 0.3796 | 0.2986 | 0.2253 | 0.503 | 0.7868 |
K = 10 | 0.646 | 0.489 | 0.366 | 0.287 | 0.231 | 0.502 | 0.786 |
# of Keywords (K) | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
K = 1 | 0.611 | 0.4781 | 0.3511 | 0.292 | 0.2254 | 0.5013 | 0.717 |
K = 2 | 0.657 | 0.4875 | 0.3694 | 0.2911 | 0.2251 | 0.502 | 0.7810 |
K = 4 | 0.6455 | 0.4889 | 0.3796 | 0.2986 | 0.2253 | 0.503 | 0.7868 |
K = 5 | 0.611 | 0.4854 | 0.3610 | 0.2878 | 0.219 | 0.5021 | 0.694 |
Models | IoU-1 | IoU-2 |
---|---|---|
Proposed Model | 0.5157 | 0.5443 |
-S | 0.5061 | 0.5338 |
-S -A | 0.5048 | 0.5329 |
-Modality Detector | 0.5023 | 0.5304 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Heo, Y.; Kang, S.; Seo, J. Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System. Sensors 2023, 23, 7875. https://doi.org/10.3390/s23187875
Heo Y, Kang S, Seo J. Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System. Sensors. 2023; 23(18):7875. https://doi.org/10.3390/s23187875
Chicago/Turabian StyleHeo, Yoonseok, Sangwoo Kang, and Jungyun Seo. 2023. "Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System" Sensors 23, no. 18: 7875. https://doi.org/10.3390/s23187875
APA StyleHeo, Y., Kang, S., & Seo, J. (2023). Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System. Sensors, 23(18), 7875. https://doi.org/10.3390/s23187875