2.1. Methods for Human Action Recognition
In the realm of deep learning, the pursuit of human action recognition began with the deployment of 2D CNNs [
21,
22], adept at processing spatial features within images [
2,
3]. Karpathy et al. [
2] primarily innovated in providing an extensive empirical evaluation of CNNs on large-scale video classification, utilizing a new dataset of 1 million YouTube videos to demonstrate the effectiveness of CNNs in accurately classifying a vast array of video content. As research progressed, 3D CNNs were introduced to capture the temporal dynamics within video data, enhancing the comprehension of action sequences over time [
1,
23,
24]. Tran et al. [
23] proposed an effective approach for spatio-temporal feature learning using deep 3-dimensional convolutional networks (C3D) for enhanced analysis. Carreira and Zisserman [
1] introduced a significant advancement by re-evaluating the state-of-the-art architectures. They inflated 2D kernel and initialized the model with the pre-trained 2D CNN net to reach better recognition performances. However, the reliance on large annotated datasets posed a challenge for these methods.
Graph Convolutional Networks [
5,
25,
26,
27] (GCNs) later emerged to more adeptly handle non-Euclidean data, such as skeletal information, capturing the intricate interplay of human motions and joint relationships. Yan et al. [
5] introduced a noval model called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which innovatively addressed skeleton-based human action recognition by effectively capturing both spatial and temporal patterns in dynamic skeletons using graph convolutional networks. Follow-up works by Shi and Zhang [
26] enhanced the model ST-GCN by introducing a novel two-stream adaptive graph convolutional network (2s-AGCN) that adapts the graph topology dynamically for more effective skeleton-based action recognition. Liu et al. [
27] improves upon the ST-GCN model by focusing on disentangling and unifying graph convolutions, which allows for a more efficient and comprehensive understanding of action recognition task. To address the long-term dependencies present in video and sequence data, LSTMs [
28,
29,
30] were widely adopted to construct temporal relationships with different spatial alignments. More recently, the Transformer model [
20], celebrated for its success in natural language processing, made its foray into the visual domain. Its self-attention mechanism [
31] is particularly effective at capturing global dependencies, a vital attribute for complex action recognition tasks. Li et al. proposed an innovative group activity recognition network [
6] that effectively captures spatial-temporal contextual information through a clustered Transformer approach, enhancing the understanding of group dynamics. Spatial-Temporal Transformer network (ST-TR) [
32] models dependencies between joints using the Transformer architecture, offering a significant improvement in capturing complex human movements.
This evolutionary trajectory reflects a continuous ambition within the deep learning community to craft more precise and interpretative action recognition systems. Each advancement has supplemented and optimized the capabilities of prior techniques, striving to enhance the representational power and recognition accuracy of models. From the localized feature extraction of 2D CNNs to the spatio-temporal comprehension afforded by 3D CNNs, the structural data processing by GCNs, the long-term sequence handling by LSTMs, and up to the global contextual relationships captured by Transformers, each step has represented a further enhancement of action recognition capabilities. These developments not only signify technological progress but also indicate potential future research directions, such as multimodal learning and adversarial training, aligning with the central tenets of our work. In our study, we utilize a video encoder based on 3D CNNs, in line with the PoseConv3D model [
4] for fair comparison. Our model, named SkeletonCLIP++, innovates beyond their approach by implementing a weighted frame integration technique, which departs from global average pooling to highlight the significance of key frames in action sequences, leveraging semantic information through an adversarial learning approach, in keeping with the current research trajectory.
2.2. Advancements in Natural Language Processing
The field of natural language understanding has witnessed significant advancements with the advent of the Transformer model, particularly due to its self-attention mechanism. Unlike traditional recursive [
33,
34] or convolutional structures [
35], self-attention enables the model to simultaneously consider all other words in a sentence when processing each word, capturing intricate inter-word relationships with ease. This mechanism directly addresses long-range dependencies, overcoming the limitations previously encountered by sequential models. The innovation of BERT (Bidirectional Encoder Representations from Transformers) [
20] lies in its bidirectional context learning. By employing two pre-training tasks, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), BERT learns comprehensive word representations. These tasks are designed to imbue the model with an understanding of language structures, preparing it to tackle complex linguistic constructs more effectively.
BERT’s effectiveness has been proven across various NLP tasks, showcasing its versatility and superior performance in semantic understanding [
36], sentiment analysis [
20], and question-answering systems [
37]. Zhang et al. [
36] enhances BERT’s language understanding capabilities by integrating semantic information, evaluated across multiple NLP tasks including natural language inference and text classification, thereby improving the model’s ability to grasp deeper semantic relationships. Koroteev [
37] highlights BERT’s application in question-answering systems, showcasing its ability to enhance the accuracy and relevance of responses by deeply understanding the context and nuances of natural language queries. Its ability to extract profound semantic features from text has prompted researchers to explore its application in the visual domain, particularly for semantic understanding of videos and images. The success in these tasks highlights BERT’s extensive application potential and its achievement in yielding remarkable results.
Owing to the extensive pre-training on large datasets, BERT has become an invaluable tool to reduce the training load and prevent overfitting, especially when textual information is sparse in the application process. Therefore, in this study, we opt for a pre-trained BERT model as our text encoder, leveraging its pre-training to enhance our network’s performance. This section has laid the groundwork for the subsequent section, which will delve into the specifics of applying BERT within the visual domain, marking a cross-disciplinary effort to unify the understanding of both textual and visual modalities under a cohesive learning framework.
2.3. Applications of the CLIP Model
Following the significant achievements of BERT in natural language processing, researchers have extended its methodologies to the visual domain, particularly in the semantic understanding of images and videos. Among the most notable outcomes of interdisciplinary endeavors is the CLIP model. CLIP’s innovation lies in its ability to learn visual concepts from natural language supervision, effectively bridging the gap between vision and language understanding. The model operates on the principle of contrasting text–image pairs, learning to associate images with captions in a manner that generalizes to a wide array of visual tasks. This has led to exceptional performance in various image-related applications, such as zero-shot classification, object detection, and even complex scene understanding.
In the realm of video understanding, CLIP’s principles have been adapted to leverage the temporal inherent in videos. Researchers approach videos as sequences of images, extending the use of Vision Transformers (ViT) [
38] as video encoders. Some methodologies [
14,
16,
17,
18] employ the output of the ‘
cls_token’ as the feature representation for image frames, while others [
15,
19] average all patch output features. Transitioning from frame features to video features, several models [
14,
15,
18,
19] average frame features directly; others [
16,
17] fuse temporal information using LSTM or Transformer structures before averaging to capture video features. Innovations such as the introduction of a message token for temporal feature extraction and fusion layers within ViT models have also been explored to enhance temporal understanding [
17]. Focusing on prompting schemes has also been a strategy to fine-tune models for specific video understanding tasks, leveraging the pre-trained nature of these models to improve performance with minimal additional training. Vita-CLIP [
18] introduced a novel multi-modal prompt learning scheme that effectively balanced supervised and zero-shot performance in action recognition, leveraging the strengths of CLIP in a unified framework for both video and text modalities.
However, a common thread among these studies is the use of ViT as the video encoder, owing to the rich image data available for pre-training. Yet, for the domain-specific task of action recognition using skeleton data, ViT does not capitalize on its pre-training advantage due to the distinct nature of the data. In our work, we select a 3D CNN as the video encoder, similar to our previous work SkeletonCLIP [
19], which has shown robust performance on skeleton-based action recognition tasks. Our model, unlike most current methodologies, does not employ global average pooling for frame feature to video feature computation. Instead, we opt for a weighted integration approach, where each frame’s weight is correlated with the textual features, allowing for a semantic-rich video feature conducive to accurate recognition tasks. Additionally, our work introduces the task of contrastive sample identification, challenging a binary classifier to discern closely resembling samples. This novel task aims to enhance the model’s discriminative capacity, ultimately improving overall recognition performances.
In summarizing the related work within this chapter, we have traced the evolution of techniques from vision-based models to language models and their eventual convergence in our current research. Our work with the SkeletonCLIP++ model seeks to assimilate these developments, offering improvements on existing methods in skeleton-based human action recognition.
Our approach utilizes the proven efficacy of 3D CNNs for video encoding, informed by prior research, and enhances feature computation by integrating semantic context—a method not yet widely adopted in current frameworks. We introduce a discriminative task to refine the model’s ability to distinguish between closely resembling actions. While we build upon the established foundation set by BERT and CLIP, our model is an extension rather than a reinvention, aiming to optimize the synergy between textual and visual modalities in action recognition. The incorporation of adversarial learning tasks reflects a step towards improving the nuanced understanding of complex actions.
In the next section, we will introduce the specifics of our model, particularly focusing on the implementation details of the Weighted Frame Integration (WFI) module and the Contrastive Sample Identification (CSI) task. Additionally, we will present a thorough description of the training process.