now publishers - Robust Multi-Domain Multi-Turn Dialogue Policy via Student-Teacher Offline Reinforcement Learning

APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Robust Multi-Domain Multi-Turn Dialogue Policy via Student-Teacher Offline Reinforcement Learning

Mahdin Rohmatillah, National Yang Ming Chiao Tung University, Taiwan, Jen-Tzung Chien, National Yang Ming Chiao Tung University, Taiwan, jtchien@nycu.edu.tw

Suggested Citation

Mahdin Rohmatillah and Jen-Tzung Chien (2024), "Robust Multi-Domain Multi-Turn Dialogue Policy via Student-Teacher Offline Reinforcement Learning", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e18. http://dx.doi.org/10.1561/116.20240024

Publication Date: 09 Sep 2024

Subjects

Topic detection and tracking, Question answering, Reinforcement learning, Speech and spoken language processing, Statistical/Machine learning, Markov decision processes, Stochastic optimization, Optimization, Applied mathematics

Keywords

Dialogue system, dialogue policy optimization, student-teacher learning, offline reinforcement learning

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 98 times

In this article:

Abstract

Dialogue policy plays a crucial role in a dialogue system as it determines the system response given a user input. In a pipeline system, the dialogue policy is susceptible to the performance degradation when the preceding components fail to produce correct output. To address this issue, this paper proposes a new method to train a robust dialogue policy that can handle noisy representation due to the mispredicted user dialogue acts from natural language understanding component. This method is mainly designed with two strategies, which are student-teacher learning and offline reinforcement learning. Student-teacher learning aims to force the student model to map the extracted features of the noisy input to be close to the clean features extracted by teacher model. Meanwhile, the offline reinforcement learning with multi-label classification objective is used to train the dialogue policy to provide appropriate response given user input by only utilizing the trajectories stored in the dataset. The experimental results show that the proposed hybrid learning can substantially improve the multi-turn end-to-end performance in a pipeline dialogue using MultiWOZ 2.1 dataset under ConvLab-2 evaluation framework. Furthermore, competitive results are obtained when compared to the end-to-end performance by using the pre-trained GPT-2 model with lower computational cost.

DOI:10.1561/116.20240024

Introduction
Multi-Domain Task-Oriented Dialogue
Robust Multi-Domain Multi-Turn Dialogue Policy Learning
Experiments
Conclusion
References

Robust Multi-Domain Multi-Turn Dialogue Policy via Student-Teacher Offline Reinforcement Learning

Share

Journal details

Abstract