Figure 1. Overall architecture of the proposed SimFLE. SimFLE is divided into two branches: GFL and FFL. In GFL, the backbone CNN learns global facial features through contrastive method. In FFL, FaceMAE encodes patch-level representations of fine-grained facial landmarks and transfers them to the backbone CNN.
Figure 2. Examples of masked autoencoding results of FaceMAE. The encoder must encode relationships between facial landmarks or overall facial context into patch-level representations of visible facial landmarks.
Our SimFLE tends to be more attentive to facial landmarks when compared to supervised baseline and other self-supervised methods, which leads to higher performance.
Figure 3. Visualization of attention maps using Grad-CAM. The first row shows the attention map of the supervised FER-W model. The second to third rows show the attention maps of the FER-W model trained with BYOL and our proposed SimFLE, respectively. Compared to the supervised counterpart, the BYOL-trained model rarely localizes facial landmarks. On the other hand, our SimFLE-trained model tends to be more attentive to facial landmarks, even better than the supervised one.
Figure 4. Visualization of attention maps for more challenging examples. A model trained by SimFLE performed well for more challenging cases such as rotated faces or occluded facial landmarks.
GitHub - jymoon0613/simfle: Official PyTorch implementation of SimFLE.