APSIPA Transactions on Signal and Information Processing > Vol 10 > Issue 1

Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Hsing-Hung Chou, Institute of Communications Engineering, National Tsing Hua University, Taiwan, paul8301526@gmail.com , Ching-Te Chiu, Institute of Communications Engineering, National Tsing Hua University, Taiwan AND Institute of Computer Science, National Tsing Hua University, Taiwan, Yi-Ping Liao, Institute of Computer Science, National Tsing Hua University, Taiwan
 
Suggested Citation
Hsing-Hung Chou, Ching-Te Chiu and Yi-Ping Liao (2021), "Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network", APSIPA Transactions on Signal and Information Processing: Vol. 10: No. 1, e18. http://dx.doi.org/10.1017/ATSIP.2021.16

Publication Date: 17 Nov 2021
© 2021 Hsing-Hung Chou, Ching-Te Chiu and Yi-Ping Liao
 
Subjects
 
Keywords
Deep convolutional model compressionKnowledge distillationTransfer learning
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1161 times

In this article:
I. INTRODUCTION 
II. RELATED WORK 
III. PROPOSED ARCHITECTURE 
IV. EXPERIMENTAL RESULTS 
V. DISCUSSION 
VI. CONCLUSION 

Abstract

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.

DOI:10.1017/ATSIP.2021.16

Companion

APSIPA Transactions on Signal and Information Processing Deep Neural Networks: Representation, Interpretation, and Applications: Articles Overview
See the other articles that are part of this special issue.