Release of BAAI's Emu3: Validating a New Paradigm for Multimodal Models
Emu3 redefines multimodal AI by using next-token prediction to generate text, images, and video, offering a unified paradigm for Any-to-Any tasks.
Former OpenAI Chief Scientist and co-founder Ilya Sutskever has expressed his view on multiple occasions: if we can perfectly predict the next token, it will help humanity achieve Artificial General Intelligence (AGI).
While next-token prediction has led to breakthroughs like ChatGPT in large language models, its applicability in multimodal models remains unclear. Multimodal tasks are still dominated by diffusion models (e.g., Stable Diffusion) and combination approaches (e.g., using CLIP vision encoders and LLMs).
On October 21, 2024, the Beijing Academy of Artificial Intelligence (BAAI) officially released Emu3, a native multimodal world model.
This model relies solely on next-token prediction without diffusion models or combination methods to understand and generate text, images, and video across three modalities.
Emu3 outperforms well-known open-source models like SDXL, LLaVA, and OpenSora in tasks such as image generation, video generation, and visual language understanding. However, it does so without diffusion models, CLIP vision encoders, or pre-trained LLMs—only relying on next-token prediction.
Emu3 introduces a powerful visual tokenizer that can convert video and images into discrete tokens. These visual tokens can be processed alongside the discrete tokens output by the text tokenizer.
At the same time, the model’s output tokens can be converted into text, images, and video, offering a more unified research paradigm for Any-to-Any tasks, something that had been lacking in the community.
Moreover, thanks to Emu3’s flexible next-token prediction framework, Direct Preference Optimization (DPO) can be seamlessly applied to autoregressive visual generation, aligning the model with human preferences.
Research on Emu3 demonstrates that next-token prediction is a powerful paradigm for multimodal models, enabling large-scale multimodal learning beyond just language and achieving advanced performance in multimodal tasks.
Converging complex multimodal designs into token-based frameworks, unlocks enormous potential for large-scale training and inference.
Next-token prediction offers a promising path toward building multimodal AGI.
Emu3’s key technologies and models have now been open-sourced. (Links to the open-source model and code are provided at the end.)
Upon release, Emu3 sparked lively discussions on social media and in the tech community: