Abstract
Argument visual states are helpful for detecting structured components of events in videos, and existing methods tend to use object detectors to generate their candidates. However, directly leveraging object features captured by bounding boxes overlooks a deep understanding of object relations and differences between them and real arguments. In this work, we propose a novel framework to generate selective contextual representations of videos, thereby reducing the interference of useless or incorrect object features. Firstly, we construct grid-based object features as graphs based on the internal grid connection and then use graph convolutional network to execute feature aggregation. Secondly, a weighted geometric attention module is designed to obtain the contextual representation of objects, which explicitly combines visual similarity and geometric correlation with different importance proportions. Then, we propose a dual relation-aware selection module for further feature selection. Finally, we utilize labels as the ladder to bridge the gap between object features and semantic roles, while considering the proximity in the semantic space. Experimental results and extensive ablation studies on the VidSitu indicate that our method effectively obtains a deep understanding of events in videos and outperforms state-of-the-art models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of supporting data
Data will be available upon request.
References
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. CoRR arXiv:1705.06950
Aguilar IG, García-González J, Baena RML, López-Rubio E (2023) Object detection in traffic videos: an optimized approach using super-resolution and maximal clique algorithm. Neural Comput Appl 35(26):18999–19013
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970
Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C, Malik J (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp 6047–6056
Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: ICCV, pp 4534–4542
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: CVPR, pp 984–992
Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: CVPR, pp 6588–6597
Madake J, Bhatlawande S, Purandare S, Shilaskar S, Nikhare Y (2022) Dense video captioning using bilstm encoder. In: 2022 3rd international conference for emerging technology (INCET), pp 1–6
Qian Y, Mao Y, Chen Z, Li C, Bloh OT, Huang Q (2023) Dense video captioning based on local attention. IET Image Process 17(9):2673–2685
Sadhu A, Gupta T, Yatskar M, Nevatia R, Kembhavi A (2021) Visual semantic role labeling for video understanding. In: CVPR, pp 5589–5600
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008
Yang G, Li M, Zhang J, Lin X, Ji H, Chang S (2023) Video event extraction via tracking visual states of arguments. In: AAAI, pp 3136–3144
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Bacciu D, Errica F, Micheli A, Podda M (2020) A gentle introduction to deep learning for graphs. Neural Netw 129:203–221
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR (Poster)
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 4724–4733
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6201–6210
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV (8). Lecture notes in computer science, vol 9912, pp 20–36
Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action at a distance. In: ICCV, pp 726–733
Ikizler N, Forsyth DA (2007) Searching video for complex activities with finite state models. In: CVPR
Ikizler N, Forsyth DA (2008) Searching for complex human activities with no visual examples. Int J Comput Vis 80(3):337–357
Herzig R, Levi E, Xu H, Gao H, Brosh E, Wang X, Globerson A, Darrell T (2019) Spatio-temporal action graph networks. In: ICCV Workshops, pp 2347–2356
Kong J, Wang S, Jiang M, Liu T (2023) Multi-stream ternary enhanced graph convolutional network for skeleton-based action recognition. Neural Comput Appl 35(25):18487–18504
Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7093–7102
Cui S, Yu B, Liu T, Zhang Z, Wang X, Shi J (2020) Event detection with relation-aware graph convolutional neural networks. CoRR arXiv:2002.10757
Xiao S, Chen L, Gao K, Wang Z, Yang Y, Zhang Z, Xiao J (2022) Rethinking multi-modal alignment in multi-choice videoqa from feature and sample perspectives. In: EMNLP, pp 8188–8198
Zhang Z, Lan C, Zeng W, Jin X, Chen Z (2020) Relation-aware global attention for person re-identification. In: CVPR, pp 3183–3192
Li Y, Ma Y, Zhou Y, Yu X (2023) Semantic-guided selective representation for image captioning. IEEE Access 11:14500–14510
Qi S, Yang L, Li C, Huang Y (2022) Dual relation-aware synergistic attention network for image-text matching. In: 2022 11th international conference on communications, circuits and systems (ICCCAS), pp 251–256
Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7093–7102
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: NeurIPS, pp 11135–11145
Dubey S, Olimov F, Rafique MA, Kim J, Jeon M (2023) Label-attention transformer with geometrically coherent objects for image captioning. Inf Sci 623:812–831
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches Out, pp 74–81
Chen Y, Cao Y, Hu H, Wang L (2020) Memory enhanced global-local aggregation for video object detection. In: CVPR, pp 10334–10343
Gao K, Chen L, Huang Y, Xiao J (2021) Video relation detection via tracklet based visual transformer. In: ACM Multimedia, pp 4833–4837
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: ICIP, pp 3645–3649
Gao K, Chen L, Niu Y, Shao J, Xiao J (2022) Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In: CVPR, pp 19475–19484
Shang X, Di D, Xiao J, Cao Y, Yang X, Chua T (2019) Annotating objects and relations in user-generated videos. In: ICMR, pp 279–287
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML. Proceedings of Machine Learning Research, vol 139, pp 813–824
Li Y, Wu C, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: improved multiscale vision transformers for classification and detection. In: CVPR, pp 4794–4804
Acknowledgements
This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).
Author information
Authors and Affiliations
Contributions
Wei Liu: Writing - Original Draft, Writing - Editing, Software, Data curation. Qing He: Writing - Original Draft, Writing - Editing, Software, Data curation. Chao Wang: Writing - Original Draft, Writing - Editing, Software, Data curation. Yan Peng: Writing - Review, Editing. Shaorong Xie: Writing - Review, Editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, W., He, Q., Wang, C. et al. Selective arguments representation with dual relation-aware network for video situation recognition. Neural Comput & Applic 36, 9945–9961 (2024). https://doi.org/10.1007/s00521-024-09655-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09655-5