Abstract
The video file is a collection of image sequential; this image sequence holds both spatial and temporal information. Optical flow and motion history images are two well-known methods for the identification of human activities. Optical flow describes the speed of every individual pixel point in the picture. Still, this information about the motion cannot represent the complete action and different movement speeds. The durations of Local body parts show almost similar intensity in the Motion history image. Therefore, similar actions are not identifying with good precision. In this paper, a deep convolutional neural model for human activities recognition video has been proposed in which multiple CNN streams are combined. The model combines spatial and temporal information. Two fusion schemes, i.e. Average fusion and convolution fusion of spatial and temporal stream, are discussed in this paper. The proposed method performs better than other approaches based on human activity recognition methods on a benchmark dataset, namely UCF101 and HMDB51.Average fusion score 95.4% test accuracy and convolution fusion score 97.2% test accuracy on UCF101 and for HMDB51, average fusion score 84.3% and convolution fusion score 85.1% respectively.
Similar content being viewed by others
References
Bhagat C, Kushwaha AKR (2019) Delving Deeper with Dual-Stream CNN for Activity Recognition: Select Proceedings of IC3E 2018. https://doi.org/10.1007/978-981-13-2685-1_32
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3034–3042
Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) ImageNet: a large-scale hierarchical image database. In: CVPR, pp 248–255
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the Advances in Neural Information pro- cessing systems, pp 3468–3476
Girdhar R, Deva R, Abhinav G, Josef S, Bryan R (2017) Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 971–980
Karpathy A, George T, Sanketh S, Thomas L, Rahul S, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732
Khurana R, Kushwaha AKS (2019) Delving Deeper with Dual-Stream CNN for Activity Recognition. In Recent Trends in Communication, Computing, and Electronics, pp 333–342. Springer, Singapore
Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett. https://doi.org/10.1016/j.patrec.2018.04.035
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. ICCV
Kushwaha AKS, Srivastava S, Srivastava R (2017) Multi-view human activity recognition based on silhouette and uniform rotation invariant local binary patterns. Multimedia Syst 23(4):451–467
Roy D, Srinivas M, Chalavadi KM (2016) Sparsity-inducing dictionaries for effective action classification. Pattern Recogn. https://doi.org/10.1016/j.patcog.2016.03.011
Simonyan K, Andrew Z (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp 568–576
Singh R, Kushwaha AKS, Srivastava R (2019) Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimedia Tools Appl 78(12):17165–17196
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint https://arXiv:1212.0402
Sun L, Kui J, Dit-Yan Y, Bertram ES (2015) Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp 4597–4605
Tsai D-M, Chiu W-Y, Lee M-H (2015) Optical flow-motion history image (OF-MHI) for action recognition. SIViP 9(8):1897–1906. https://github.com/tomar840/two-stream-fusion-for-action-recognition-in-videos
Tran D, Lubomir B, Rob F, Lorenzo T, Manohar P (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tu Z, Xie W, Qin Q, Poppe R, Veltkamp R, Li B, Yuan J (2018) Multi-stream CNN: learning representations based on human related regions for action recognition. Pattern Recogn 79:32–43
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3551–3558
Wang J, Cherian A, Porikli F, Gould S (2018) Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 pp 1149–1158
Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recogn Lett 92:33–40
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-con- volutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4305–4314
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pp 20–36. Springer, Cham. https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-9a07dd44cf9
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, Springer, pp 20–36
Zhu Y, Zhenzhong L, Shawn N, Alexander H (2018) Hidden two-stream convolutional networks for action recognition. Asian Conference on Computer Vision. Springer, Cham, pp 363–378
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Varshney, N., Bakariya, B. Deep convolutional neural model for human activities recognition in a sequence of video by combining multiple CNN streams. Multimed Tools Appl 81, 42117–42129 (2022). https://doi.org/10.1007/s11042-021-11220-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11220-4