Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data
Abstract
:1. Introduction
- (1)
- An experimental paradigm for data collection is designed. The paradigm stimulates emotions through reading and interviewing tasks. This paper adopts the experimental paradigm that includes both reading and interviewing tasks to collect data and performs preprocessing on the video, audio, and text data of each subject to establish a multimodal database.
- (2)
- We propose an AVTF-TBN model, which extracts audio, video, and text features using a three-branch architecture. An MMF module fuses the three modality features based on attention mechanisms and residual connections.
- (3)
- The AVTF-TBN model is used to conduct depression risk detection on data from different tasks and questions in our dataset, and we compare the ability of different tasks and questions to stimulate emotions.
2. Related Works
2.1. Depression Risk Detection Based on Unimodal Information
2.1.1. Depression Risk Detection Based on Video
2.1.2. Depression Risk Detection Based on Audio
2.1.3. Depression Risk Detection Based on Text
2.2. Depression Risk Detection Based on Multimodal Information
3. Datasets
3.1. Experimental Paradigm
3.2. Dataset Collection and Analysis
3.3. Dataset Preprocessing
3.4. Dataset Usage
4. Methodology
4.1. Overview of AVTF-TBN
4.2. Video Branch
4.3. Audio Branch
4.4. Text Branch
4.5. MMF Module
5. Experiment Setup and Result Discussion
5.1. Experimental Scheme
5.1.1. Implementation Details
5.1.2. Evaluation Metrics
5.2. Experimental Results
5.2.1. Detection Results Based on Different Tasks Data
5.2.2. Detection Results Based on Different Questions Data
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- World Health Organization. Depressive Disorder (Depression). Available online: https://www.who.int/zh/news-room/fact-sheets/detail/depression (accessed on 30 December 2023).
- Institute of Health Metrics and Evaluation. Global Health Data Exchange (GHDx). Available online: https://vizhub.healthdata.org/gbd-results (accessed on 30 December 2023).
- Perez, J.E.; Riggio, R.E. Nonverbal social skills and psychopathology. In Nonverbal Behavior in Clinical Settings; Oxford University Press: Oxford, UK, 2003; pp. 17–44. [Google Scholar] [CrossRef]
- Waxer, P. Nonverbal cues for depression. J. Abnorm. Psychol. 1974, 83, 319. [Google Scholar] [CrossRef]
- Cummins, N.; Scherer, S.; Krajewski, J.; Schnieder, S.; Epps, J.; Quatieri, T.F. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015, 71, 10–49. [Google Scholar] [CrossRef]
- Segrin, C. Social skills deficits associated with depression. Clin. Psychol. Rev. 2000, 20, 379–403. [Google Scholar] [CrossRef]
- Zinken, J.; Zinken, K.; Wilson, J.C.; Butler, L.; Skinner, T. Analysis of syntax and word use to predict successful participation in guided self-help for anxiety and depression. Psychiatry Res. 2010, 179, 181–186. [Google Scholar] [CrossRef] [PubMed]
- Oxman, T.E.; Rosenberg, S.D.; Schnurr, P.P.; Tucker, G.J. Diagnostic classification through content analysis of patients’ speech. Am. J. Psychiatry 1988, 145, 464–468. [Google Scholar] [CrossRef] [PubMed]
- Yang, L.; Jiang, D.; Xia, X.; Pei, E.; Oveneke, M.C.; Sahli, H. Multimodal measurement of depression using deep learning models. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23 October 2017; pp. 53–59. [Google Scholar]
- Gratch, J.; Artstein, R.; Lucas, G.M.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The distress analysis interview corpus of human and computer interviews. In Proceedings of the 2014 International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014; pp. 3123–3128. [Google Scholar]
- Valstar, M.; Schuller, B.; Smith, K.; Eyben, F.; Jiang, B.; Bilakhia, S.; Schnieder, S.; Cowie, R.; Pantic, M. Avec 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, Barcelona, Spain, 21 October 2013; pp. 3–10. [Google Scholar]
- Wang, Q.; Yang, H.; Yu, Y. Facial expression video analysis for depression detection in Chinese patients. J. Vis. Commun. Image Represent. 2018, 57, 228–233. [Google Scholar] [CrossRef]
- Mehrabian, A.; Russell, J.A. An Approach to Environmental Psychology; MIT Press: Cambridge, MA, USA, 1974. [Google Scholar]
- Girard, J.M.; Cohn, J.F.; Mahoor, M.H.; Mavadati, S.M.; Hammal, Z.; Rosenwald, D.P. Nonverbal social withdrawal in depression: Evidence from manual and automatic analyses. Image Vis. Comput. 2014, 32, 641–647. [Google Scholar] [CrossRef]
- Alghowinem, S.; Goecke, R.; Wagner, M.; Parker, G.; Breakspear, M. Eye movement analysis for depression detection. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, VIC, Australia, 15–18 September 2013; pp. 4220–4224. [Google Scholar]
- Jan, A.; Meng, H.; Gaus, Y.F.A.; Zhang, F.; Turabzadeh, S. Automatic depression scale prediction using facial expression dynamics and regression. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, USA, 7 November 2014; pp. 73–80. [Google Scholar]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- Tadalagi, M.; Joshi, A.M. AutoDep: Automatic depression detection using facial expressions based on linear binary pattern descriptor. Med. Biol. Eng. Comput. 2021, 59, 1339–1354. [Google Scholar] [CrossRef]
- Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef]
- He, L.; Jiang, D.; Sahli, H. Automatic depression analysis using dynamic facial appearance descriptor and dirichlet process fisher encoding. IEEE Trans. Multimed. 2018, 21, 1476–1486. [Google Scholar] [CrossRef]
- Yang, L.; Jiang, D.; Sahli, H. Integrating deep and shallow models for multi-modal depression analysis—Hybrid architectures. IEEE Trans. Affect. Comput. 2018, 12, 239–253. [Google Scholar] [CrossRef]
- Dibeklioğlu, H.; Hammal, Z.; Cohn, J.F. Dynamic multimodal measurement of depression severity using deep autoencoding. IEEE J. Biomed. Health Inform. 2017, 22, 525–536. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Shang, Y.; Shao, Z.; Guo, G. Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans. Affect. Comput. 2017, 9, 578–584. [Google Scholar] [CrossRef]
- He, L.; Chan, J.C.-W.; Wang, Z. Automatic depression recognition using CNN with attention mechanism from videos. Neurocomputing 2021, 422, 165–175. [Google Scholar] [CrossRef]
- Song, S.; Shen, L.; Valstar, M. Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 158–165. [Google Scholar]
- Niu, M.; He, L.; Li, Y.; Liu, B. Depressioner: Facial dynamic representation for automatic depression level prediction. Expert Syst. Appl. 2022, 204, 117512. [Google Scholar] [CrossRef]
- Xu, J.; Song, S.; Kusumam, K.; Gunes, H.; Valstar, M. Two-stage temporal modelling framework for video-based depression recognition using graph representation. arXiv 2021, arXiv:2111.15266. [Google Scholar]
- Cannizzaro, M.; Harel, B.; Reilly, N.; Chappell, P.; Snyder, P.J. Voice acoustical measurement of the severity of major depression. Brain Cogn. 2004, 56, 30–35. [Google Scholar] [CrossRef] [PubMed]
- Moore, E., II; Clements, M.A.; Peifer, J.W.; Weisser, L. Critical analysis of the impact of glottal features in the classification of clinical depression in speech. IEEE Trans. Biomed. Eng. 2007, 55, 96–107. [Google Scholar] [CrossRef]
- Chen, W.; Xing, X.; Xu, X.; Pang, J.; Du, L. SpeechFormer: A hierarchical efficient framework incorporating the characteristics of speech. arXiv 2022, arXiv:2203.03812. [Google Scholar] [CrossRef]
- Zhao, Y.; Liang, Z.; Du, J.; Zhang, L.; Liu, C.; Zhao, L. Multi-head attention-based long short-term memory for depression detection from speech. Front. Neurorobotics 2021, 15, 684037. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Z.; Li, Q.; Cummins, N.; Liu, B.; Wang, H.; Tao, J.; Schuller, B. Hybrid network feature extraction for depression assessment from speech. In Proceedings of the INTERSPEECH 2020, Shanghai, China, 25–29 October 2020. [Google Scholar] [CrossRef]
- Sardari, S.; Nakisa, B.; Rastgoo, M.N.; Eklund, P. Audio based depression detection using Convolutional Autoencoder. Expert Syst. Appl. 2022, 189, 116076. [Google Scholar] [CrossRef]
- Hosseini-Saravani, S.H.; Besharati, S.; Calvo, H.; Gelbukh, A. Depression detection in social media using a psychoanalytical technique for feature extraction and a cognitive based classifier. In Mexican International Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2020; pp. 282–292. [Google Scholar]
- Rude, S.; Gortner, E.-M.; Pennebaker, J. Language use of depressed and depression-vulnerable college students. Cogn. Emot. 2004, 18, 1121–1133. [Google Scholar] [CrossRef]
- Chiong, R.; Budhi, G.S.; Dhakal, S.; Chiong, F. A textual-based featuring approach for depression detection using machine learning classifiers and social media texts. Comput. Biol. Med. 2021, 135, 104499. [Google Scholar] [CrossRef]
- Jang, B.; Kim, M.; Harerimana, G.; Kang, S.-U.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
- Ansari, L.; Ji, S.; Chen, Q.; Cambria, E. Ensemble hybrid learning methods for automated depression detection. IEEE Trans. Comput. Soc. Syst. 2022, 10, 211–219. [Google Scholar] [CrossRef]
- Jan, A.; Meng, H.; Gaus, Y.F.B.A.; Zhang, F. Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Trans. Cogn. Dev. Syst. 2017, 10, 668–680. [Google Scholar] [CrossRef]
- Niu, M.; Tao, J.; Liu, B.; Huang, J.; Lian, Z. Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. 2020, 14, 294–307. [Google Scholar] [CrossRef]
- Dai, Z.; Li, Q.; Shang, Y.; Wang, X.A. Depression Detection Based on Facial Expression, Audio and Gait. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023; pp. 1568–1573. [Google Scholar]
- Solieman, H.; Pustozerov, E.A. The detection of depression using multimodal models based on text and voice quality features. In Proceedings of the 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), St. Petersburg, Russia, 26–29 January 2021; pp. 1843–1848. [Google Scholar]
- Shen, Y.; Yang, H.; Lin, L. Automatic depression detection: An emotional audio-textual corpus and a GRU/BiLSTM-based model. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6247–6251. [Google Scholar]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Paradise, NV, USA, 26 June–1 July 2016; pp. 5297–5307. [Google Scholar]
- Fang, M.; Peng, S.; Liang, Y.; Hung, C.-C.; Liu, S. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed. Signal Process. Control 2023, 82, 104561. [Google Scholar] [CrossRef]
- Zheng, W.; Yan, L.; Gou, C.; Wang, F.-Y. Graph attention model embedded with multi-modal knowledge for depression detection. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Sudhan, H.M.; Kumar, S.S. Multimodal Depression Severity Detection Using Deep Neural Networks and Depression Assessment Scale. In Proceedings of the International Conference on Computational Intelligence and Data Engineering: ICCIDE 2021, Vijayawada, India, 13–14 August 2021; Springer: Singapore, 2022; pp. 361–375. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar] [CrossRef]
- Zhang, S.; Zhao, Z.; Guan, C. Multimodal continuous emotion recognition: A technical report for abaw5. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5763–5768. [Google Scholar]
- Sun, H.; Wang, H.; Liu, J.; Chen, Y.-W.; Lin, L. CubeMLP: An MLP-based model for multimodal sentiment analysis and depression estimation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3722–3729. [Google Scholar]
- Rajan, V.; Brutti, A.; Cavallaro, A. Is cross-attention preferable to self-attention for multi-modal emotion recognition? In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4693–4697. [Google Scholar]
Emotion Elicitation Paradigm | ||
---|---|---|
Presenter | Contents | |
Reading task | ||
Interviewer | Please read the “The Little Match Girl” (fragment). | |
Interview task | ||
Interviewer | Are your physical and mental conditions healthy and good recently? | |
The interviewee answered “No” | Interviewer | Can you talk about it in detail? |
Interviewer | When did this situation start? | |
Interviewer | Is the situation worse, better, or completely better now than it was at the beginning? | |
The interviewee answered, “Yes” | Interviewer | Have you ever had any troubles or pain in the past? If so, can you talk about it? |
Interviewer | How much did it affect your life at that time? | |
Interviewer | What were the influences on your work, life, learning, interests, and interpersonal communication at that time? | |
Interviewer | Can you evaluate yourself? | |
Interviewer | What kind of person do you think you are in the eyes of the people around you? | |
Interviewer | What do you think your ideal self is like? |
PHQ-9 Score Segment | Subject Category |
---|---|
0–4 | Health |
5–9 | Mild depression risk |
10–14 | Moderate depression risk |
15–19 | Moderate to severe depression risk |
20–27 | Severe depression risk |
Category | Depression Risk Category | Health Category | |
---|---|---|---|
Information | |||
Number of humans (Male/Female) | 621 (240/381) | 1290 (527/763) | |
Age | 34.6 (10.0) | 37.5 (10.7) | |
Height | 164.2 (8.1) | 164.0 (8.1) | |
Weight | 61.0 (12.4) | 61.4 (13.0) |
Question Number | Problem Content | Number of Respondents |
---|---|---|
Q1 | Are your physical and mental conditions healthy and good recently? | 240 |
Q1.1 | Can you talk about it in detail? | 47 |
Q1.2 | When did this situation start? | 51 |
Q1.3 | Is the situation worse, better, or completely better now than it was at the beginning? | 50 |
Q2 | Have you ever had any troubles or pain in the past? If so, can you talk about it? | 140 |
Q2.1 | How much did it affect your life at that time? | 81 |
Q2.2 | What were the influences on your work, life, learning, interests, and interpersonal communication at that time? | 46 |
Q3 | Can you evaluate yourself? | 205 |
Q4 | What kind of person do you think you are in the eyes of the people around you? | 172 |
Q5 | What do you think your ideal self is like? | 127 |
Model 1 | Data Sources | Modality 2 | F1 Score | Precision | Recall |
---|---|---|---|---|---|
Reading task | Unimodal Data | ||||
AVTF-TBN | V | 0.65 | 0.62 | 0.69 | |
A | 0.62 | 0.63 | 0.61 | ||
Multimodal Data | |||||
AVTF-TBN (Concatenate) | V + A | 0.66 | 0.65 | 0.67 | |
AVTF-TBN (MMF) | V + A | 0.68 | 0.61 | 0.78 |
Model | Data Sources | Modality | F1 Score | Precision | Recall |
---|---|---|---|---|---|
Interview task | Unimodal Data | ||||
AVTF-TBN | V | 0.65 | 0.66 | 0.64 | |
A | 0.62 | 0.63 | 0.61 | ||
T | 0.64 | 0.64 | 0.64 | ||
Multimodal Data | |||||
AVTF-TBN (Concatenate) | V + A | 0.67 | 0.57 | 0.81 | |
V + T | 0.67 | 0.62 | 0.72 | ||
A + T | 0.66 | 0.68 | 0.64 | ||
V + A + T | 0.74 | 0.70 | 0.78 | ||
AVTF-TBN (MMF) | V + A + T | 0.76 | 0.74 | 0.78 |
Model | Data Sources | MHA | Modality | F1 Score | Precision | Recall |
---|---|---|---|---|---|---|
AVTF-TBN | Interview task | Yes | V | 0.65 | 0.66 | 0.64 |
A | 0.62 | 0.63 | 0.61 | |||
T | 0.64 | 0.64 | 0.64 | |||
No | V | 0.63 | 0.62 | 0.64 | ||
A | 0.61 | 0.61 | 0.61 | |||
T | 0.62 | 0.61 | 0.64 |
Models | Data Sources | Modality | F1 Score | Precision | Recall |
---|---|---|---|---|---|
AVTF-TBN (MMF) | Reading task | V + A | 0.68 | 0.61 | 0.78 |
Interview task | V + A + T | 0.76 | 0.74 | 0.78 | |
Reading task + Interview task | V + A + T | 0.78 | 0.76 | 0.81 |
Models | Data Sources | Modality | F1 Score | Precision | Recall |
---|---|---|---|---|---|
CubeMLP [50] | Reading task | V + A | 0.56 | 0.53 | 0.59 |
Self-Attention Model [51] | 0.57 | 0.54 | 0.61 | ||
AVTF-TBN(MMF) (Ours) | 0.68 | 0.61 | 0.78 | ||
CubeMLP [50] | Interview task | V + A + T | 0.67 | 0.70 | 0.64 |
Self-Attention Model [51] | 0.70 | 0.71 | 0.69 | ||
AVTF-TBN(MMF) (Ours) | 0.76 | 0.74 | 0.78 | ||
CubeMLP [50] | Reading task + Interview task | V + A + T | 0.68 | 0.70 | 0.66 |
Self-Attention Model [51] | 0.73 | 0.72 | 0.74 | ||
AVTF-TBN(MMF) (Ours) | 0.78 | 0.76 | 0.81 |
Model | Data Sources | Modality | F1 Score | Precision | Recall |
---|---|---|---|---|---|
AVTF-TBN (MMF) | Q1 | V + A + T | 0.65 | 0.53 | 0.86 |
Q2 | 0.61 | 0.56 | 0.67 | ||
Q3 | 0.67 | 0.54 | 0.88 | ||
Q4 | 0.63 | 0.60 | 0.67 | ||
Q5 | 0.65 | 0.57 | 0.77 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Z.; Zhang, S.; Ni, D.; Wei, Z.; Yang, K.; Jin, S.; Huang, G.; Liang, Z.; Zhang, L.; Li, L.; et al. Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data. Sensors 2024, 24, 3714. https://doi.org/10.3390/s24123714
Zhang Z, Zhang S, Ni D, Wei Z, Yang K, Jin S, Huang G, Liang Z, Zhang L, Li L, et al. Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data. Sensors. 2024; 24(12):3714. https://doi.org/10.3390/s24123714
Chicago/Turabian StyleZhang, Zhenwei, Shengming Zhang, Dong Ni, Zhaoguo Wei, Kongjun Yang, Shan Jin, Gan Huang, Zhen Liang, Li Zhang, Linling Li, and et al. 2024. "Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data" Sensors 24, no. 12: 3714. https://doi.org/10.3390/s24123714