APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Robust Multi-Domain Multi-Turn Dialogue Policy via Student-Teacher Offline Reinforcement Learning

Mahdin Rohmatillah, National Yang Ming Chiao Tung University, Taiwan, Jen-Tzung Chien, National Yang Ming Chiao Tung University, Taiwan, jtchien@nycu.edu.tw
 
Suggested Citation
Mahdin Rohmatillah and Jen-Tzung Chien (2024), "Robust Multi-Domain Multi-Turn Dialogue Policy via Student-Teacher Offline Reinforcement Learning", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e18. http://dx.doi.org/10.1561/116.20240024

Publication Date: 09 Sep 2024
© 2024 M. Rohmatillah and J.-T. Chien
 
Subjects
Topic detection and tracking,  Question answering,  Reinforcement learning,  Speech and spoken language processing,  Statistical/Machine learning,  Markov decision processes,  Stochastic optimization,  Optimization,  Applied mathematics
 
Keywords
Dialogue systemdialogue policy optimizationstudent-teacher learningoffline reinforcement learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 222 times

In this article:
Introduction 
Multi-Domain Task-Oriented Dialogue 
Robust Multi-Domain Multi-Turn Dialogue Policy Learning 
Experiments 
Conclusion 
References 

Abstract

Dialogue policy plays a crucial role in a dialogue system as it determines the system response given a user input. In a pipeline system, the dialogue policy is susceptible to the performance degradation when the preceding components fail to produce correct output. To address this issue, this paper proposes a new method to train a robust dialogue policy that can handle noisy representation due to the mispredicted user dialogue acts from natural language understanding component. This method is mainly designed with two strategies, which are student-teacher learning and offline reinforcement learning. Student-teacher learning aims to force the student model to map the extracted features of the noisy input to be close to the clean features extracted by teacher model. Meanwhile, the offline reinforcement learning with multi-label classification objective is used to train the dialogue policy to provide appropriate response given user input by only utilizing the trajectories stored in the dataset. The experimental results show that the proposed hybrid learning can substantially improve the multi-turn end-to-end performance in a pipeline dialogue using MultiWOZ 2.1 dataset under ConvLab-2 evaluation framework. Furthermore, competitive results are obtained when compared to the end-to-end performance by using the pre-trained GPT-2 model with lower computational cost.

DOI:10.1561/116.20240024