iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S00521-024-09655-5
Selective arguments representation with dual relation-aware network for video situation recognition | Neural Computing and Applications Skip to main content

Advertisement

Log in

Selective arguments representation with dual relation-aware network for video situation recognition

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Argument visual states are helpful for detecting structured components of events in videos, and existing methods tend to use object detectors to generate their candidates. However, directly leveraging object features captured by bounding boxes overlooks a deep understanding of object relations and differences between them and real arguments. In this work, we propose a novel framework to generate selective contextual representations of videos, thereby reducing the interference of useless or incorrect object features. Firstly, we construct grid-based object features as graphs based on the internal grid connection and then use graph convolutional network to execute feature aggregation. Secondly, a weighted geometric attention module is designed to obtain the contextual representation of objects, which explicitly combines visual similarity and geometric correlation with different importance proportions. Then, we propose a dual relation-aware selection module for further feature selection. Finally, we utilize labels as the ladder to bridge the gap between object features and semantic roles, while considering the proximity in the semantic space. Experimental results and extensive ablation studies on the VidSitu indicate that our method effectively obtains a deep understanding of events in videos and outperforms state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of supporting data

Data will be available upon request.

References

  1. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. CoRR arXiv:1705.06950

  2. Aguilar IG, García-González J, Baena RML, López-Rubio E (2023) Object detection in traffic videos: an optimized approach using super-resolution and maximal clique algorithm. Neural Comput Appl 35(26):18999–19013

    Article  Google Scholar 

  3. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970

  4. Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C, Malik J (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp 6047–6056

  5. Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: ICCV, pp 4534–4542

  6. Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: CVPR, pp 984–992

  7. Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: CVPR, pp 6588–6597

  8. Madake J, Bhatlawande S, Purandare S, Shilaskar S, Nikhare Y (2022) Dense video captioning using bilstm encoder. In: 2022 3rd international conference for emerging technology (INCET), pp 1–6

  9. Qian Y, Mao Y, Chen Z, Li C, Bloh OT, Huang Q (2023) Dense video captioning based on local attention. IET Image Process 17(9):2673–2685

    Article  Google Scholar 

  10. Sadhu A, Gupta T, Yatskar M, Nevatia R, Kembhavi A (2021) Visual semantic role labeling for video understanding. In: CVPR, pp 5589–5600

  11. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, pp 5998–6008

  12. Yang G, Li M, Zhang J, Lin X, Ji H, Chang S (2023) Video event extraction via tracking visual states of arguments. In: AAAI, pp 3136–3144

  13. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  14. Bacciu D, Errica F, Micheli A, Podda M (2020) A gentle introduction to deep learning for graphs. Neural Netw 129:203–221

    Article  Google Scholar 

  15. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR (Poster)

  16. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 4724–4733

  17. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6201–6210

  18. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV (8). Lecture notes in computer science, vol 9912, pp 20–36

  19. Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action at a distance. In: ICCV, pp 726–733

  20. Ikizler N, Forsyth DA (2007) Searching video for complex activities with finite state models. In: CVPR

  21. Ikizler N, Forsyth DA (2008) Searching for complex human activities with no visual examples. Int J Comput Vis 80(3):337–357

    Article  Google Scholar 

  22. Herzig R, Levi E, Xu H, Gao H, Brosh E, Wang X, Globerson A, Darrell T (2019) Spatio-temporal action graph networks. In: ICCV Workshops, pp 2347–2356

  23. Kong J, Wang S, Jiang M, Liu T (2023) Multi-stream ternary enhanced graph convolutional network for skeleton-based action recognition. Neural Comput Appl 35(25):18487–18504

    Article  Google Scholar 

  24. Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7093–7102

  25. Cui S, Yu B, Liu T, Zhang Z, Wang X, Shi J (2020) Event detection with relation-aware graph convolutional neural networks. CoRR arXiv:2002.10757

  26. Xiao S, Chen L, Gao K, Wang Z, Yang Y, Zhang Z, Xiao J (2022) Rethinking multi-modal alignment in multi-choice videoqa from feature and sample perspectives. In: EMNLP, pp 8188–8198

  27. Zhang Z, Lan C, Zeng W, Jin X, Chen Z (2020) Relation-aware global attention for person re-identification. In: CVPR, pp 3183–3192

  28. Li Y, Ma Y, Zhou Y, Yu X (2023) Semantic-guided selective representation for image captioning. IEEE Access 11:14500–14510

    Article  Google Scholar 

  29. Qi S, Yang L, Li C, Huang Y (2022) Dual relation-aware synergistic attention network for image-text matching. In: 2022 11th international conference on communications, circuits and systems (ICCCAS), pp 251–256

  30. Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7093–7102

  31. Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: NeurIPS, pp 11135–11145

  32. Dubey S, Olimov F, Rafique MA, Kim J, Jeon M (2023) Label-attention transformer with geometrically coherent objects for image captioning. Inf Sci 623:812–831

    Article  Google Scholar 

  33. Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575

  34. Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches Out, pp 74–81

  35. Chen Y, Cao Y, Hu H, Wang L (2020) Memory enhanced global-local aggregation for video object detection. In: CVPR, pp 10334–10343

  36. Gao K, Chen L, Huang Y, Xiao J (2021) Video relation detection via tracklet based visual transformer. In: ACM Multimedia, pp 4833–4837

  37. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  38. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: ICIP, pp 3645–3649

  39. Gao K, Chen L, Niu Y, Shao J, Xiao J (2022) Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In: CVPR, pp 19475–19484

  40. Shang X, Di D, Xiao J, Cao Y, Yang X, Chua T (2019) Annotating objects and relations in user-generated videos. In: ICMR, pp 279–287

  41. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML. Proceedings of Machine Learning Research, vol 139, pp 813–824

  42. Li Y, Wu C, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: improved multiscale vision transformers for classification and detection. In: CVPR, pp 4794–4804

Download references

Acknowledgements

This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).

Author information

Authors and Affiliations

Authors

Contributions

Wei Liu: Writing - Original Draft, Writing - Editing, Software, Data curation. Qing He: Writing - Original Draft, Writing - Editing, Software, Data curation. Chao Wang: Writing - Original Draft, Writing - Editing, Software, Data curation. Yan Peng: Writing - Review, Editing. Shaorong Xie: Writing - Review, Editing.

Corresponding author

Correspondence to Chao Wang.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., He, Q., Wang, C. et al. Selective arguments representation with dual relation-aware network for video situation recognition. Neural Comput & Applic 36, 9945–9961 (2024). https://doi.org/10.1007/s00521-024-09655-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09655-5

Keywords

Navigation