APSIPA Transactions on Signal and Information Processing > Vol 10 > Issue 1

3D skeletal movement-enhanced emotion recognition networks

Jiaqi Shi, Graduate School of Engineering Science, Osaka University, Japan AND Guardian Robot Project, RIKEN, Japan, shi.jiaqi@irl.sys.es.osaka-u.ac.jp , Chaoran Liu, Advanced Telecommunications Research Institute International, Japan, Carlos Toshinori Ishi, Guardian Robot Project, RIKEN, Japan AND Advanced Telecommunications Research Institute International, Japan, Hiroshi Ishiguro, Graduate School of Engineering Science, Osaka University, Japan AND Advanced Telecommunications Research Institute International, Japan
 
Suggested Citation
Jiaqi Shi, Chaoran Liu, Carlos Toshinori Ishi and Hiroshi Ishiguro (2021), "3D skeletal movement-enhanced emotion recognition networks", APSIPA Transactions on Signal and Information Processing: Vol. 10: No. 1, e12. http://dx.doi.org/10.1017/ATSIP.2021.11

Publication Date: 05 Aug 2021
© 2021 Jiaqi Shi, Chaoran Liu, Carlos Toshinori Ishi and Hiroshi Ishiguro
 
Subjects
 
Keywords
Deep learningemotion recognitiongestureskeleton
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 993 times

In this article:
I. INTRODUCTION 
II. RELATED STUDIES 
III. METHODOLOGY 
IV. EXPERIMENTS AND RESULTS 
V. ABLATION STUDY AND DISCUSSION 
VI. CONCLUSIONS 
FINANCIAL SUPPORT 

Abstract

Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.

DOI:10.1017/ATSIP.2021.11