1. Introduction
Micro-expressions (MEs) are involuntary facial movements with the characteristics of short duration, low intensity, and occurrence in sparse facial action units [
1,
2]. It is generally believed that the duration of ME is between 1/25 s and 1/2 s [
3]. Micro-expression (ME) recognition is a challenging task; even the recognition accuracy by people with specialized training is below 50% [
4,
5]. Because MEs can reveal genuine emotions people try to hide [
1,
6], ME recognition has many potential applications in different fields, such as criminal investigation, commercial negotiation, clinical diagnosis, and so on [
7,
8]. Due to the characteristics of short duration and subtlety, how to extract discriminatory features from ME video clips is a key problem in the task of ME recognition [
9]. In recent years, automatic detection and recognition of MEs has become an active research topic in computer vision [
10,
11,
12].
In 2011, Pfister et al. [
13] applied LBP-TOP (local binary pattern with three orthogonal planes) [
14] to extract dynamic features of MEs on SMIC [
12] dataset, and they proposed a benchmark framework for automatic ME recognition. In 2014, Yan et al. [
15] established a new ME dataset called CASME2 and used LBP-TOP for ME recognition. Huang et al. [
16] proposed a completed local quantization patterns (CLQP) method, which extends LQP by using the sign-based difference, the magnitude-based difference, and the orientation-based difference, and then converts them into binary codes. Wang et al. [
17] proposed LBP with six intersection points (LBP-SIP) to obtain a more compact feature representation. The STLBP-IP [
18] method proposed by Huang et al. uses integral projection based on difference image and LBP to extract the spatiotemporal features of MEs. In addition, Zong et al. [
19] expanded the effectiveness of the LBP Operator by layered STLBP-IP features and reduced the dimension of features by using the sparse learning method.
Lu et al. [
20] proposed a Delaunay-based temporal coding model (DTCM) to represent spatiotemporally important features for MEs. Xu et al. [
21] proposed a method called Facial Dynamic Map (FDM) to represent the movement patterns of MEs based on dense optical flow. Liu et al. [
22] proposed a ME recognition method called Main Directional Mean Optical flow (MDMO), in which a face image is divided into 36 subregions, and the principal direction optical flow of all regions is connected to obtain a low dimensional feature vector. Liong et al. [
23] proposed a method of ME detection and recognition by using optical strain information, which can better represent fine, subtle facial movements.
Considering deep learning methods have achieved good performances in facial expression recognition, recently, researchers have attempted to apply deep learning to the task of ME recognition. In [
24], Kim et al. proposed to use convolutional neural network (CNN) to encode the spatial features of MEs at different expression-states, and then transfer the spatial features into a Long Short-Term Memory (LSTM) network to learn spatiotemporal features. Peng et al. [
25] proposed a dual time-scale convolutional neural network, in which the different stream structures of the network can be used to adapt to ME clips of different frame rates. Li et al. [
26] proposed spotting ME apex frames in the frequency domain and fine-tuning a VGG-Face model with magnified apex frames. In the work of [
27], Khor et al. introduced an Enriched Long-term Recurrent Convolutional Network (ELRCN) model for micro-expression recognition, which encodes ME features by combining a deep spatial feature learning module and a temporal learning module. Li et al. [
28] presented a 3D flow-based CNN (3D-FCNN) model for micro-expression recognition, which uses optical flow together with raw grayscale frames as input to a 12-layer deep network.
Due to the difficulties of ME elicitation and sample annotation, available datasets for training are very small, which limits the performances of deep neural networks for ME recognition. This article investigates the application of a shallow PCANet+ [
29] model for the task of ME recognition. PCANet [
30] combines principal component analysis (PCA) with CNN architecture. Despite its simplicity, PCANet has achieved promising results in image classification tasks, such as face recognition. As an extension model, PCANet+ eliminates the problem of complete linearity of PCANet and also alleviates the problem of feature dimension explosion by adding a pooling unit between adjacent layers. In this article, we propose a novel ME recognition method (OF-PCANet+) by incorporating the PCANet+ network and dense optical flow calculation. Considering the subtlety of MEs, we first calculate the optical flow from input ME video clips to enhance the motion information; then, we construct multi-channel images by stacking the optical flow fields of consecutive frames and feed them into a two-layer PCANet+ network to learn more powerful spatiotemporal features. A linear SVM is adopted in the classification of ME video clips. Experimental results on publicly available SMIC [
12] and CASME2 [
15] datasets demonstrate the effectiveness of the proposed method. The main contributions of this article are summarized as follows:
We propose a lightweight OF-PCANet+ method for ME recognition, which is computationally simple and which can meanwhile produce promising recognition performance.
We present a spatiotemporal feature learning strategy for ME recognition. Discriminative spatiotemporal features can be learned automatically by feeding stacked optical flow sequences into the PCANet+ network.
The rest of this article is organized as follows.
Section 2 gives a brief introduction to optical flow calculation and the PCANet+ model.
Section 3 describes our proposed method in detail.
Section 4 presents experimental results and discussions, and the conclusions are given in
Section 5.