Foundations and Trends® in Computer Graphics and Vision > Vol 16 > Issue 1-2

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

By Chunyuan Li, Microsoft Corporation, USA, chunyl@microsoft.com | Zhe Gan, Microsoft Corporation, USA, zhgan@microsoft.com | Zhengyuan Yang, Microsoft Corporation, USA, zhengyang@microsoft.com | Jianwei Yang, Microsoft Corporation, USA, jianwyan@microsoft.com | Linjie Li, Microsoft Corporation, USA, linjli@microsoft.com | Lijuan Wang, Microsoft Corporation, USA, lijuanw@microsoft.com | Jianfeng Gao, Microsoft Corporation, USA, jfgao@microsoft.com

 
Suggested Citation
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang and Jianfeng Gao (2024), "Multimodal Foundation Models: From Specialists to General-Purpose Assistants", Foundations and Trends® in Computer Graphics and Vision: Vol. 16: No. 1-2, pp 1-214. http://dx.doi.org/10.1561/0600000110

Publication Date: 06 May 2024
© 2024 C. Li et al.
 
Subjects
Multimodal interaction,  Perception and the user interface,  Natural language processing for IR,  Question answering,  Text mining,  Deep learning,  Language paradigms,  Languages on the web,  Object and scene recognition,  Image and video processing,  Statistical/Machine learning
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Visual Understanding
3. Visual Generation
4. Unified Vision Models
5. Large Multimodal Models: Training with LLMs
6. Multimodal Agents: Chaining Tools with LLM
7. Conclusion and Research Trends
Acknowledgments
References

Abstract

Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened up new possibilities for data compression, allowing compression algorithms to be learned end-to-end from data using powerful generative models such as normalizing flows, variational autoencoders, diffusion probabilistic models, and generative adversarial networks. This monograph aims to introduce this field of research to a broader machine learning audience by reviewing the necessary background in information theory (e.g., entropy coding, rate-distortion theory) and computer vision (e.g., image quality assessment, perceptual metrics), and providing a curated guide through the essential ideas and methods in the literature thus far.

DOI:10.1561/0600000110
ISBN: 978-1-63828-336-2
230 pp. $99.00
Buy book (pb)
 
ISBN: 978-1-63828-337-9
230 pp. $310.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Visual Understanding
3. Visual Generation
4. Unified Vision Models
5. Large Multimodal Models: Training with LLMs
6. Multimodal Agents: Chaining Tools with LLM
7. Conclusion and Research Trends
Acknowledgments
References

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This monograph presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.

The focus encompasses five core topics, categorized into two classes; (i) a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics – methods of learning vision backbones for visual understanding and text-to-image generation; (ii) recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics – unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs.

The target audience of the monograph is researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

 
CGV-110