By Xiaogang Wang, The Chinese University of Hong Kong, Hong Kong, xgwang@ee.cuhk.edu.hk
As a major breakthrough in artificial intelligence, deep learning has achieved very impressive success in solving grand challenges in many fields including speech recognition, natural language processing, computer vision, image and video processing, and multimedia. This article provides a historical overview of deep learning and focus on its applications in object recognition, detection, and segmentation, which are key challenges of computer vision and have numerous applications to images and videos. The discussed research topics on object recognition include image classification on ImageNet, face recognition, and video classification. The detection part covers general object detection on ImageNet, pedestrian detection, face landmark detection (face alignment), and human landmark detection (pose estimation). On the segmentation side, the article discusses the most recent progress on scene labeling, semantic segmentation, face parsing, human parsing and saliency detection. Object recognition is considered as whole-image classification, while detection and segmentation are pixelwise classification tasks. Their fundamental differences will be discussed in this article. Fully convolutional neural networks and highly efficient forward and backward propagation algorithms specially designed for pixelwise classification task will be introduced. The covered application domains are also much diversified. Human and face images have regular structures, while general object and scene images have much more complex variations in geometric structures and layout. Videos include the temporal dimension. Therefore, they need to be processed with different deep models. All the selected domain applications have received tremendous attentions in the computer vision and multimedia communities. Through concrete examples of these applications, we explain the key points which make deep learning outperform conventional computer vision systems. (1) Different than traditional pattern recognition systems, which heavily rely on manually designed features, deep learning automatically learns hierarchical feature representations from massive training data and disentangles hidden factors of input data through multi-level nonlinear mappings. (2) Different than existing pattern recognition systems which sequentially design or train their key components, deep learning is able to jointly optimize all the components and crate synergy through close interactions among them. (3) While most machine learning models can be approximated with neural networks with shallow structures, for some tasks, the expressive power of deep models increases exponentially as their architectures go deep. Deep models are especially good at learning global contextual feature representation with their deep structures. (4) Benefitting from the large learning capacity of deep models, some classical computer vision challenges can be recast as high-dimensional data transform problems and can be solved from new perspectives. Finally, some open questions and future works regarding to deep learning in object recognition, detection, and segmentation will be discussed.
As a major breakthrough in artificial intelligence, deep learning has achieved impressive success on solving grand challenges in many fields including speech recognition, natural language processing, computer vision, image and video processing, and multimedia. This monograph provides a historical overview of deep learning and focuses on its applications in object recognition, detection, and segmentation, which are key challenges of computer vision and have numerous applications to images and videos.
Specifically the topics covered under object recognition include image classification on ImageNet, face recognition, and video classification. In detection, the monograph covers general object detection on ImageNet, pedestrian detection, face landmark detection (face alignment), and human landmark detection (pose estimation). Finally, within segmentation, it covers the most recent progress on scene labeling, semantic segmentation, face parsing, human parsing, and saliency detection. Concrete examples of these applications explain the key points that make deep learning outperform conventional computer vision systems.
Deep Learning in Object Recognition, Detection, and Segmentation provides a comprehensive introductory overview of a topic that is having major impact on many areas of research in signal processing, computer vision, and machine learning. This is a must-read for students and researchers new to these fields.