SimFLE: Simple Facial Landmark Encoding for FER-W

Self-Supervised Learning for FER-W

Problem of Interest

Facial expression recognition in the wild (FER-W) is a task in which computers perform emotion recognition by understanding facial expressions in natural and uncontrolled environments. FER-W is essential for building emotion-aware intelligent systems (e.g., social robots) because facial expression is one of the most powerful signals of human beings.
Recent FER-W methods have focused on supervised learning, which requires a large amount of labeled data for training, but the visual complexity and inherent ambiguity of facial expressions impede curating large-scale labeled facial images.
Therefore, we explore a method to train accurate FER-W models in a self-supervised way. Recently, contrastive learning (CL) has shown promising results in obtaining semantic representations even without labeled data.
However, we found that CL alone provides sub-optimal FER-W performance because CL does not have the ability to explicitly model fine-grained facial landmarks, important facial regions that are decisive in recognizing facial expressions.

Our Method & Results

This paper solves the above issues by proposing Simple Facial Landmark Encoding (SimFLE) method, which adapts masked image modeling (MIM) as a sub-task of CL to encode facial landmarks at the patch-level.
The architecture is divided into two branches: global facial feature learning (GFL) and fine-grained facial feature learning (FFL). The backbone network learns global facial features from GFL. At the same time, FFL transfers patch-level encoding of fine-grained facial landmarks to enhance the attentiveness of the backbone network to facial landmarks. The two branches are connected via distillation.

Figure 1. Overall architecture of the proposed SimFLE. SimFLE is divided into two branches: GFL and FFL. In GFL, the backbone CNN learns global facial features through contrastive method. In FFL, FaceMAE encodes patch-level representations of fine-grained facial landmarks and transfers them to the backbone CNN.

Specifically, we propose novel FaceMAE module, specialized MAE for FER-W task, to achieve the goal of FFL. With semantic masking, FaceMAE estimates facial landmark regions at patch-level and encodes effective representations from them.
We change random masking in MAE to elaborately designed semantic. Semantic masking determines which image patches to be masked based on the output feature map from the backbone. On the feature map, image patches containing low-activated regions are masked, and image patches containing high-activated facial landmark regions are preserved.

Figure 2. Examples of masked autoencoding results of FaceMAE. The encoder must encode relationships between facial landmarks or overall facial context into patch-level representations of visible facial landmarks.

Our SimFLE tends to be more attentive to facial landmarks when compared to supervised baseline and other self-supervised methods, which leads to higher performance.

Figure 3. Visualization of attention maps using Grad-CAM. The first row shows the attention map of the supervised FER-W model. The second to third rows show the attention maps of the FER-W model trained with BYOL and our proposed SimFLE, respectively. Compared to the supervised counterpart, the BYOL-trained model rarely localizes facial landmarks. On the other hand, our SimFLE-trained model tends to be more attentive to facial landmarks, even better than the supervised one.

Figure 4. Visualization of attention maps for more challenging examples. A model trained by SimFLE performed well for more challenging cases such as rotated faces or occluded facial landmarks.

Publications & Github

GitHub - jymoon0613/simfle: Official PyTorch implementation of SimFLE.