Selective arguments representation with dual relation-aware network for video situation recognition

Liu, Wei; He, Qing; Wang, Chao; Peng, Yan; Xie, Shaorong

doi:10.1007/s00521-024-09655-5

Selective arguments representation with dual relation-aware network for video situation recognition

Review
Published: 16 April 2024

Volume 36, pages 9945–9961, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Wei Liu^1,2,
Qing He¹,
Chao Wang ORCID: orcid.org/0000-0003-4843-1953^3,4,
Yan Peng^2,3,4 &
…
Shaorong Xie¹

167 Accesses
Explore all metrics

Abstract

Argument visual states are helpful for detecting structured components of events in videos, and existing methods tend to use object detectors to generate their candidates. However, directly leveraging object features captured by bounding boxes overlooks a deep understanding of object relations and differences between them and real arguments. In this work, we propose a novel framework to generate selective contextual representations of videos, thereby reducing the interference of useless or incorrect object features. Firstly, we construct grid-based object features as graphs based on the internal grid connection and then use graph convolutional network to execute feature aggregation. Secondly, a weighted geometric attention module is designed to obtain the contextual representation of objects, which explicitly combines visual similarity and geometric correlation with different importance proportions. Then, we propose a dual relation-aware selection module for further feature selection. Finally, we utilize labels as the ladder to bridge the gap between object features and semantic roles, while considering the proximity in the semantic space. Experimental results and extensive ablation studies on the VidSitu indicate that our method effectively obtains a deep understanding of events in videos and outperforms state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Representation Learning on Visual-Symbolic Graphs for Video Understanding

Online video visual relation detection with hierarchical multi-modal fusion

Article 18 January 2024

Instance-sequence reasoning for video question answering

Article 02 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of supporting data

Data will be available upon request.

References

Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. CoRR arXiv:1705.06950
Aguilar IG, García-González J, Baena RML, López-Rubio E (2023) Object detection in traffic videos: an optimized approach using super-resolution and maximal clique algorithm. Neural Comput Appl 35(26):18999–19013
Article Google Scholar
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970
Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C, Malik J (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp 6047–6056
Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: ICCV, pp 4534–4542
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: CVPR, pp 984–992
Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: CVPR, pp 6588–6597
Madake J, Bhatlawande S, Purandare S, Shilaskar S, Nikhare Y (2022) Dense video captioning using bilstm encoder. In: 2022 3rd international conference for emerging technology (INCET), pp 1–6
Qian Y, Mao Y, Chen Z, Li C, Bloh OT, Huang Q (2023) Dense video captioning based on local attention. IET Image Process 17(9):2673–2685
Article Google Scholar
Sadhu A, Gupta T, Yatskar M, Nevatia R, Kembhavi A (2021) Visual semantic role labeling for video understanding. In: CVPR, pp 5589–5600
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008
Yang G, Li M, Zhang J, Lin X, Ji H, Chang S (2023) Video event extraction via tracking visual states of arguments. In: AAAI, pp 3136–3144
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Bacciu D, Errica F, Micheli A, Podda M (2020) A gentle introduction to deep learning for graphs. Neural Netw 129:203–221
Article Google Scholar
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR (Poster)
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 4724–4733
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6201–6210
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV (8). Lecture notes in computer science, vol 9912, pp 20–36
Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action at a distance. In: ICCV, pp 726–733
Ikizler N, Forsyth DA (2007) Searching video for complex activities with finite state models. In: CVPR
Ikizler N, Forsyth DA (2008) Searching for complex human activities with no visual examples. Int J Comput Vis 80(3):337–357
Article Google Scholar
Herzig R, Levi E, Xu H, Gao H, Brosh E, Wang X, Globerson A, Darrell T (2019) Spatio-temporal action graph networks. In: ICCV Workshops, pp 2347–2356
Kong J, Wang S, Jiang M, Liu T (2023) Multi-stream ternary enhanced graph convolutional network for skeleton-based action recognition. Neural Comput Appl 35(25):18487–18504
Article Google Scholar
Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7093–7102
Cui S, Yu B, Liu T, Zhang Z, Wang X, Shi J (2020) Event detection with relation-aware graph convolutional neural networks. CoRR arXiv:2002.10757
Xiao S, Chen L, Gao K, Wang Z, Yang Y, Zhang Z, Xiao J (2022) Rethinking multi-modal alignment in multi-choice videoqa from feature and sample perspectives. In: EMNLP, pp 8188–8198
Zhang Z, Lan C, Zeng W, Jin X, Chen Z (2020) Relation-aware global attention for person re-identification. In: CVPR, pp 3183–3192
Li Y, Ma Y, Zhou Y, Yu X (2023) Semantic-guided selective representation for image captioning. IEEE Access 11:14500–14510
Article Google Scholar
Qi S, Yang L, Li C, Huang Y (2022) Dual relation-aware synergistic attention network for image-text matching. In: 2022 11th international conference on communications, circuits and systems (ICCCAS), pp 251–256
Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7093–7102
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: NeurIPS, pp 11135–11145
Dubey S, Olimov F, Rafique MA, Kim J, Jeon M (2023) Label-attention transformer with geometrically coherent objects for image captioning. Inf Sci 623:812–831
Article Google Scholar
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches Out, pp 74–81
Chen Y, Cao Y, Hu H, Wang L (2020) Memory enhanced global-local aggregation for video object detection. In: CVPR, pp 10334–10343
Gao K, Chen L, Huang Y, Xiao J (2021) Video relation detection via tracklet based visual transformer. In: ACM Multimedia, pp 4833–4837
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: ICIP, pp 3645–3649
Gao K, Chen L, Niu Y, Shao J, Xiao J (2022) Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In: CVPR, pp 19475–19484
Shang X, Di D, Xiao J, Cao Y, Yang X, Chua T (2019) Annotating objects and relations in user-generated videos. In: ICMR, pp 279–287
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML. Proceedings of Machine Learning Research, vol 139, pp 813–824
Li Y, Wu C, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: improved multiscale vision transformers for classification and detection. In: CVPR, pp 4794–4804

Download references

Acknowledgements

This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Wei Liu, Qing He & Shaorong Xie
Shanghai Artificial Intelligence Laboratory, Shanghai, 201114, China
Wei Liu & Yan Peng
School of Future Technology, Shanghai University, Shanghai, 200444, China
Chao Wang & Yan Peng
Institute of Artificial Intelligence, Shanghai University, Shanghai, 200444, China
Chao Wang & Yan Peng

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qing He
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Shaorong Xie
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wei Liu: Writing - Original Draft, Writing - Editing, Software, Data curation. Qing He: Writing - Original Draft, Writing - Editing, Software, Data curation. Chao Wang: Writing - Original Draft, Writing - Editing, Software, Data curation. Yan Peng: Writing - Review, Editing. Shaorong Xie: Writing - Review, Editing.

Corresponding author

Correspondence to Chao Wang.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, W., He, Q., Wang, C. et al. Selective arguments representation with dual relation-aware network for video situation recognition. Neural Comput & Applic 36, 9945–9961 (2024). https://doi.org/10.1007/s00521-024-09655-5

Download citation

Received: 21 August 2023
Accepted: 25 March 2024
Published: 16 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s00521-024-09655-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selective arguments representation with dual relation-aware network for video situation recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Representation Learning on Visual-Symbolic Graphs for Video Understanding

Online video visual relation detection with hierarchical multi-modal fusion

Instance-sequence reasoning for video question answering

Availability of supporting data

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Selective arguments representation with dual relation-aware network for video situation recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Representation Learning on Visual-Symbolic Graphs for Video Understanding

Online video visual relation detection with hierarchical multi-modal fusion

Instance-sequence reasoning for video question answering

Explore related subjects

Availability of supporting data

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation