Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Cao, Haiwen; Wu, Chunlei; Lu, Jing; Wu, Jie; Wang, Leiquan

doi:10.1007/s11760-022-02324-x

Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Original Paper
Published: 11 August 2022

Volume 17, pages 1173–1180, (2023)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Haiwen Cao¹,
Chunlei Wu ORCID: orcid.org/0000-0002-0944-2564¹,
Jing Lu¹,
Jie Wu¹ &
…
Leiquan Wang¹

193 Accesses
3 Citations
Explore all metrics

Abstract

Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial-Temporal Co-attention Network for Action Recognition

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Article 14 July 2021

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Article Open access 11 July 2021

References

He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv Neural Inf. Process. Syst. (2014)
Tran, D., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision (2016)
Wang, H., et al.: Action recognition by dense trajectories (2011)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision (2013)
Lan, Z., et al.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Wang, Y., et al.: Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Yang, K., et al.: IF-TTN: information fused temporal transformation network for video action recognition (2019)
Wang, Y., et al.: Reversing two-stream networks with decoding discrepancy penalty for robust action recognition (2018)
Zhao, Y., Xiong, Y., Lin, D.: Recognize actions by disentangling components of dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Wang, X., et al.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Shou, Z., et al.: Dmc-net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Ioffe, S., Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift (2015)
Girdhar, R., Ramanan, D. Attentional pooling for action recognition. Adv. Neural Inf. Process. Syst. (2017)
Tran, A., Cheong, L.-F.: Two-stream flow-guided convolutional attention networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)
Kuehne, H., et al.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision (2011)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L 1 optical flow. In: Joint Pattern Recognition Symposium (2007)
Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)
Article MathSciNet Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Sharma, S., Kiros, R., Salakhutdinov, R. Action recognition using visual attention (2015)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Peng, X., et al.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)
Article Google Scholar
Huang, S., Lin, X., Karaman, S., et al.: Flow-distilled IP two-stream networks for compressed video action recognition. In: Computer Vision and Pattern Recognition (2019)
Cao, D., Xu, L., Chen, H., et al.: Action recognition in untrimmed videos with composite self-attention two-stream framework. In: Computer Vision and Pattern Recognition (2019)
Du, Y., Yuan, C., Li, B., et al.: Interaction-aware spatio-temporal pyramid attention networks for action classification. In: European Conference on Computer Vision, pp. 388–404 (2018)
Deep draw. https://github.com/auduno/deepdraw (2016)

Download references

Acknowledgements

This work is supported by the grants from the Natural Science Foundation of Shandong Province (No. ZR20-20MF136), Key Research and Development Plan of Shandong Province (No. 2019GGX101015), and the Fundamental Research Funds for the Central Universities (No. 20CX05018A).

Author information

Authors and Affiliations

College of Computer Science and Technology, China University of Petroleum, Qingdao, China
Haiwen Cao, Chunlei Wu, Jing Lu, Jie Wu & Leiquan Wang

Authors

Haiwen Cao
View author publications
You can also search for this author in PubMed Google Scholar
Chunlei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Leiquan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunlei Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cao, H., Wu, C., Lu, J. et al. Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention. SIViP 17, 1173–1180 (2023). https://doi.org/10.1007/s11760-022-02324-x

Download citation

Received: 11 January 2021
Revised: 30 August 2021
Accepted: 17 July 2022
Published: 11 August 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11760-022-02324-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatial-Temporal Co-attention Network for Action Recognition

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatial-Temporal Co-attention Network for Action Recognition

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation