As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Short videos action recognition based on deep learning has made a series of important progress; most of the proposed methods are based on 3D Convolution neural networks (3D CNN) and Two Stream architecture. However, 3D CNN has a large number of parameters and Two Stream networks cannot learn features well enough. This work aims to build a network to learn better features and reduce the scale of parameters. A Hierarchy Spatial-Temporal Transformer model is proposed, which is based on Two Stream architecture and hierarchy inference. The model is divided into three modules: Hierarchy Residual Reformer, Spatial Attention Module, and Temporal-Spatial Attention Module. In the model, each frame’s image is firstly transformed into a spatial visual feature map. Secondly, spatial feature learning is performed by spatial attention to generating attention spatial feature maps. Finally, the generated attention spatial feature map is incorporated with temporal feature vectors to generate a final representation for classification experiments. Experiment results in the hmdb51 and ucf101 data set showed that the proposed model achieved better accuracy than the state-of-art baseline models
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.