🌰 SEED (SEE & Draw) Towards Multimodal Generalist
The project began in May 2023 and continues to be updated.
For any inquiries, please email email@example.com
With the goal of General Artificial Intelligence (AGI), we aim to develop an AI agent that is not only capable of multimodal multi-task capabilities but also exhibits emergent and self-evolving abilities in an open-world context. It is a long-term endeavor. We are taking the initial step in studying a foundational model that facilitates flexible input/output formats, transitioning and reasoning seamlessly between multimodal signals while acquiring knowledge from an inherently multimodal world. Starting from the visual modality, the underlying premise to accomplish this goal is to unify visual comprehension and generation tasks within an end-to-end framework, that is, to enable the foundation model to SEE and Draw (SEED) at the same time.
We are actively looking for interns to collaborate on this project or related research topics. Please feel free to reach out if you are interested. 👐
[2 Oct 2023]
We are excited to unveil
which offers unified multimodal comprehension and generation capabilities,
featuring multi-turn in-context emergent capabilities, akin to an AI aide.
- Links: Project Page, Tech Report, Code, Models, Online Demo
- We upgrade the SEED tokenizer to v2 that is able to generate more realistic images.
- We propose a series of SEED-LLaMA-8/14B models, including the pre-trained models and the ones after instruction tuning. SEED-LLaMA is empowered by the improved SEED-2 tokenizer.
[30 July 2023]
, the most comprehensive MLLM benchmark, has been released!
- Links: Tech Report, Code, Data, Leaderboard
- Consists of 19K multiple-choice questions with accurate human annotations.
- Spans 12 evaluation dimensions in terms of both spatial and temporal comprehension.
[16 July 2023]
We are glad to release the initial version of
a tailored image tokenizer that empowers Large Language Models with the emergent ability to see and draw.
- Links: Project Page, Tech Report, Code & Models
- The presented SEED is a discrete image tokenizer. The SEED visual tokens capture high-level semantics while being generated with a 1D causal dependency.
- SEED enables LLMs to be trained with multimodal data following the original recipe of text (i.e., next-word prediction), which is mature and scalable.
The overall training paradigm of a multimodal LLM can be summarised into three stages: visual tokenizer training, multimodal pretraining, and multimodal instruction tuning.
(a) Stage I: A proper visual tokenizer can facilitate the follow-up multimodal training by (i) easing the semantic alignment between visual and word tokens, and (ii) enabling LLM’s original training recipe for multimodal data without specific adaptation for visual tokens. It is intuitive to represent visual signals as a sequence of discrete tokens, the same as words. However, existing visual tokenizers (e.g., VQ-VAE), are far from satisfactory for effective visual understanding and generation. SEED was born to meet this challenge.
(b) Stage II&III: With SEED tokens, LLM is able to perform scalable multimodal autoregression on large-scale textual and visual data. It is expected to capture rich multimodal knowledge from websites or videos during the pre-training stage. Then the pre-trained SEED-LLM can be further tuned to follow human instructions and align with human preference via supervised fine-tuning.
Quantitative evaluation [up-to-date leaderboard]
Our core contributors include Yuying Ge, Sijie Zhao, Yixiao Ge, Chen Li, Kun Yi, Xiaohan Ding, Xintao Wang, and Ying Shan, as well as the present or previous interns, Bohao Li, Guangzhi Wang, Jinguo Zhu, Rui Wang, Xiaohu Jiang, and Ziyun Zeng (names listed in alphabetical order). Sincere thanks for their efforts. We would also like to thank Lin Song, Yanpei Cao, Dong Yu, Zhengyou Zhang, and Yu Zeng for their great help, valuable suggestions and support.
The website template was borrowed from Open X-Embodiment.