iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s11042-022-12972-3
Separable 3D residual attention network for human action recognition | Multimedia Tools and Applications Skip to main content
Log in

Separable 3D residual attention network for human action recognition

  • 1178: Pattern Recognition for Adaptive User Interfaces
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As an important research issue in computer vision, human action recognition has been regarded as a crucial mean of communication and interaction between humans and computers. To help computers automatically recognize human behaviors and accurately understand human intentions, this paper proposes a separable three-dimensional residual attention network (defined as Sep-3D RAN), which is a lightweight network and can extract the informative spatial-temporal representations for the applications of video-based human computer interaction. Specifically, Sep-3D RAN is constructed via stacking multiple separable three-dimensional residual attention blocks, in which each standard three-dimensional convolution is approximated as a cascaded two-dimensional spatial convolution and a one-dimensional temporal convolution, and then a dual attention mechanism is built by embedding a channel attention sub-module and a spatial attention sub-module sequentially in each residual block, thereby acquiring more discriminative features to improve the model guidance capability. Furthermore, a multi-stage training strategy is used for Sep-3D RAN training, which can relieve the over-fitting effectively. Finally, experimental results demonstrate that the performance of Sep-3D RAN can surpass the existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Atto A M, Benoît A, Lambert P (2020) Timed-image based deep learning for action recognition in video sequences. Pattern Recogn 104:107353

    Article  Google Scholar 

  2. Bassano C, Solari F, Chessa M (2018) Studying natural human-computer interaction in immersive virtual reality: A comparison between actions in the peripersonal and in the near-action space. In: Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications-Vol 2: Hucapp. Scite Press, pp 108–115

  3. Cai J, Hu J (2020) 3d rans: 3d residual attention networks for action recognition. Vis Comput 36(6):1261–1270

    Article  Google Scholar 

  4. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Press, pp 6299–6308

  5. Castiglione A, Cozzolino G, Moscato F, et al. (2020) Cognitive analysis in social networks for viral marketing. IEEE Transactions on Industrial Informatics. https://doi.org/10.1109/TII.2020.3026013

  6. Castiglione A, Nappi M, Ricciardi S (2020) Trustworthy method for person identification in iiot environments by means of facial dynamics. IEEE Transactions on Industrial Informatics. https://doi.org/10.1109/TII.2020.2977774

  7. Chenarlogh V A, Razzazi F (2018) Multi-stream 3d cnn structure for human action recognition trained by limited data. IET Comput Vis 13(3):338–344

    Article  Google Scholar 

  8. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Press, pp 1933–1941

  9. Gu Y, Ye X, Sheng W, et al. (2020) Multiple stream deep learning model for human action recognition. Image Vis Comput 93:103818

    Article  Google Scholar 

  10. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE Press, pp 6546–6555

  11. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Press, pp 7132–7141

  12. Ji S, Xu W, Yang M, et al. (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  13. Jin T, He Z, Basu A, et al. (2019) Dense convolutional networks for efficient video analysis. In: 2019 5th International Conference on Control, Automation and Robotics. IEEE Press, pp 550–554

  14. Karpathy A, Toderici G, Shetty S, et al. (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE Press, pp 1725–1732

  15. Kay W, Carreira J, Simonyan K, et al. (2017) The kinetics human action video dataset. arXiv:1705.06950

  16. Li Q, Qiu Z, Yao T, et al. (2016) Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM Press, pp 159–166

  17. Lin B, Fang B, Yang W, et al. (2019) Human action recognition based on spatio-temporal three-dimensional scattering transform descriptor and an improved vlad feature encoding algorithm. Neurocomputing 348:145–157

    Article  Google Scholar 

  18. Liu Q, Che X, Bie M (2019) R-stan: Residual spatial-temporal attention network for action recognition. IEEE Access 7:82246–82255

    Article  Google Scholar 

  19. Meng L, Zhao B, Chang B, et al. (2019) Interpretable spatio-temporal attention for video action recognition. In: Proceedings of the IEEE international conference on computer vision workshops. IEEE Press, pp 1513–1522

  20. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: proceedings of the IEEE International Conference on Computer Vision. IEEE Press, pp 5533–5541

  21. Ren F, Bao Y (2020) A review on human-computer interaction and intelligent robots. Int J Inf Technol Decis Making 19(01):5–47

    Article  Google Scholar 

  22. Sajjad M, Khan S, Hussain T, et al. (2019) Cnn-based anti-spoofing two-tier multi-factor authentication system. Pattern Recogn Lett 126:123–131

    Article  Google Scholar 

  23. Sang H, Zhao Z, He D (2019) Two-level attention model based video action recognition network. IEEE Access 7:118388–118401

    Article  Google Scholar 

  24. Sheng B, Li J, Xiao F, et al. (2020) Multilayer deep features with multiple kernel learning for action recognition. Neurocomputing 399:65–74

    Article  Google Scholar 

  25. Soomro K, Zamir A R, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  26. Sun L, Jia K, Yeung D-Y, et al. (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision. IEEE Press, pp 4597–4605

  27. Sun S, Kuang Z, Sheng L, et al. (2018) Optical flow guided feature: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE Press, pp 1390–1399

  28. Tran D, Bourdev L, Fergus R, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. IEEE Press, pp 4489–4497

  29. Tran D, Ray J, Shou Z, et al. (2017) Convnet architecture search for spatiotemporal feature learning. arXiv:1708.05038

  30. Tran D, Wang H, Torresani L, et al. (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE Press, pp 6450–6459

  31. Tran D, Wang H, Torresani L, et al. (2019) Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. IEEE Press, pp 5552–5561

  32. Wang L, Xu Y, Cheng J, et al. (2018) Human action recognition by learning spatio-temporal features with deep neural networks. IEEE access 6:17913–17922

    Article  Google Scholar 

  33. Wang L, Xiong Y, Wang Z, et al. (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer Press, pp 20–36

  34. Wang X, Miao Z, Zhang R, et al. (2019) I3d-lstm: A new model for human action recognition. IOP Conference Series Materials Science and Engineering 569:032035

    Article  Google Scholar 

  35. Wang X, Yang L T, Song L, et al. (2020) A tensor-based multi-attributes visual feature recognition method for industrial intelligence. IEEE Transactions on Industrial Informatics. https://doi.org/10.1109/TII.2020.2999901

  36. Wang X, Yang L T, Wang Y, et al. (2020) Adtt: A highly-efficient distributed tensor-train decomposition method for iiot big data. IEEE Transactions on Industrial Informatics. https://doi.org/10.1109/TII.2020.2967768

  37. Woo S, Park J, Lee J-Y, et al. (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision. Springer Press, pp 3–19

  38. Wu Z, Huang Y, Wang L, et al. (2016) A comprehensive study on cross-view gait based human identification with deep cnns. IEEE Trans Pattern Anal Mach Intell 39(2):209–226

    Article  Google Scholar 

  39. Xie S, Sun C, Huang J, et al. (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision. Springer Press, pp 305–321

  40. Yang H, Yuan C, Li B, et al. (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recogn 85:1–12

    Article  Google Scholar 

  41. Yi Y, Li A, Zhou X (2020) Human action recognition based on action relevance weighted encoding. Signal Process Image Commun 80:115640

    Article  Google Scholar 

  42. Yu S, Xie L, Liu L, et al. (2019) Learning long-term temporal features with deep neural networks for human action recognition. IEEE Access 8:1840–1850

    Article  Google Scholar 

  43. Yu T, Guo C, Wang L, et al. (2018) Joint spatial-temporal attention for action recognition. Pattern Recogn Lett 112:226–233

    Article  Google Scholar 

  44. Zhang Y, Hao K, Tang X, et al. (2019) Long-term 3d convolutional fusion network for action recognition. In: 2019 IEEE International Conference on Artificial Intelligence and Computer Applications. IEEE Press, pp 216–220

  45. Zhang Z, Lv Z, Gan C, et al. (2020) Human action recognition using convolutional lstm and fully-connected lstm with different attentions. Neurocomputing 410:304–316

    Article  Google Scholar 

  46. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d cnn lstm networks. Biomed Signal Process Control 47:312–323

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the anonymous reviewers and the editor for their valuable comments and suggestions. This work is supported by Natural Science Foundation of China (Grant Nos. 61702066 and 61903056), Major Project of Science and Technology Research Program of Chongqing Education Commission of China (Grant No. KJZDM201900601), Chongqing Research Program of Basic Research and Frontier Technology (Grant Nos. cstc2021jcyj-msxmX0761 and cstc2018jcyjAX0154), Project Supported by Chongqing Municipal Key Laboratory of Institutions of Higher Education (Grant No. cqupt-mct-201901), Project Supported by Chongqing Key Laboratory of Mobile Communications Technology (Grant No. cqupt-mct-202002), Project Supported by Engineering Research Center of Mobile Communications, Ministry of Education (Grant No. cqupt-mct202006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenquan Gan.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Peng, Y., Gan, C. et al. Separable 3D residual attention network for human action recognition. Multimed Tools Appl 82, 5435–5453 (2023). https://doi.org/10.1007/s11042-022-12972-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12972-3

Keywords

Navigation