APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 4

EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning

Lilang Lin, Wangxuan Institute of Computer Technology, Peking University, China, Jiaying Liu, Wangxuan Institute of Computer Technology, Peking University, China, liujiaying@pku.edu.cn
 
Suggested Citation
Lilang Lin and Jiaying Liu (2023), "EMS2L: Enhanced Multi-Task Self-Supervised Learning for 3D Skeleton Representation Learning", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 4, e100. http://dx.doi.org/10.1561/116.00000022

Publication Date: 15 May 2023
© 2023 L. Lin and J. Liu
 
Subjects
 
Keywords
Self-supervised learningskeleton-based action recognitionmulti-task learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 647 times

In this article:
Introduction 
Related Work 
Enhanced Multiple Self-Supervised Learning 
Experiment Results 
Conclusion 
References 

Abstract

To learn from the numerous unlabeled data for smart infrastructure, we propose Enhanced Multi-Task Self-Supervised Learning (EMS2L) for self-supervised action recognition based on 3D human skeleton. With EMS2L, multiple self-supervised tasks are integrated to learn more comprehensive information, which is different from previous methods in which a single self-supervised task is manipulated. The self-supervised tasks employed here include task-specific methods (i.e., motion prediction and jigsaw puzzle task) and task-agnostic methods such as contrastive learning. Through the combination of these three self-supervised tasks, we can learn rich feature representations. Specifically, motion prediction is applied to extract detailed information by reconstructing original data from temporally masked and noisy sequences. Jigsaw puzzle makes the learned model capable of exploring temporal discriminative features for human action recognition by predicting the correct orders of shuffled sequences. Besides, to standardize the feature space, we utilize contrastive learning to constrain feature learning to increase the compactness within the class and separability between classes. To learn invariant representations, an attention model is proposed for contrastive representation learning to reduce the distance between original features and attention features. To avoid the performance degradation of network representation due to the pursuit of excessive invariance, this attention-based contrastive learning gives different degrees of weights to the features of different transformed data. Under a variety of settings, including fully-supervised, semi-supervised, unsupervised, and transfer learning, we evaluate EMS2L with downstream tasks. We also explore different network architectures (i.e., GRU GCN). The remarkable results on NW-UCLA, NTU RGB+D, and PKUMMD datasets illustrate the generality of our approach. With sufficient and extensive experiments, the advantage of our method is demonstrated by learning features that are more general and discriminative. Besides, we further provide more experimental analysis for different self-supervised tasks.

DOI:10.1561/116.00000022

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Emerging AI Technologies for Smart Infrastructure
See the other articles that are part of this special issue.