APSIPA Transactions on Signal and Information Processing > Vol 5 > Issue 1

Deep learning: from speech recognition to language and multimodal processing

Industrial Technology Advances

Li Deng, Microsoft Research, USA, deng@microsoft.com
 
Suggested Citation
Li Deng (2016), "Deep learning: from speech recognition to language and multimodal processing", APSIPA Transactions on Signal and Information Processing: Vol. 5: No. 1, e1. http://dx.doi.org/10.1017/ATSIP.2015.22

Publication Date: 19 Jan 2016
© 2016 Li Deng
 
Subjects
 
Keywords
Deep learningMultimodalSpeech recognitionLanguage processingDeep neural networks
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 6386 times

In this article:
I. INTRODUCTION 
II. SOME BRIEF HISTORY OF “DEEP” SPEECH RECOGNITION 
III. ACHIEVEMENTS OF DEEP LEARNING IN SPEECH RECOGNITION 
IV. DEEP LEARNING FOR NATURAL LANGUAGE AND MULTIMODAL PROCESSING 
V. CONCLUSIONS AND CHALLENGES FOR FUTURE WORK 

Abstract

While artificial neural networks have been in existence for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This invited paper, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will first reflect on the historical path to this transformative success, after providing brief reviews of earlier studies on (shallow) neural networks and on (deep) generative models relevant to the introduction of deep neural networks (DNN) to speech recognition several years ago. The role of well-timed academic-industrial collaboration is highlighted, so are the advances of big data, big compute, and the seamless integration between the application-domain knowledge of speech and general principles of deep learning. Then, an overview is given on sweeping achievements of deep learning in speech recognition since its initial success. Such achievements, summarized into six major areas in this article, have resulted in across-the-board, industry-wide deployment of deep learning in speech recognition systems. Next, more challenging applications of deep learning, natural language and multimodal processing, are selectively reviewed and analyzed. Examples include machine translation, knowledgebase completion, information retrieval, and automatic image captioning, where fresh ideas from deep learning, continuous-space embedding in particular, are shown to be revolutionizing these application areas albeit with less rapid pace than for speech and image recognition. Finally, a number of key issues in deep learning are discussed, and future directions are analyzed for perceptual tasks such as speech, image, and video, as well as for cognitive tasks involving natural language.

DOI:10.1017/ATSIP.2015.22