SEED Multimodal

🌰 SEED (SEE & Draw)
Towards Multimodal Generalist

The project began in May 2023 and continues to be updated.
For any inquiries, please email seed-x@googlegroups.com

About Us

A succinct but not necessarily comprehensive milestone of multimodal foundation models. Statistics as of Oct 10, 2023.

With the goal of General Artificial Intelligence (AGI), we aim to develop an AI agent that is not only capable of multimodal multi-task capabilities but also exhibits emergent and self-evolving abilities in an open-world context. It is a long-term endeavor. We are taking the initial step in studying a foundational model that facilitates flexible input/output formats, transitioning and reasoning seamlessly between multimodal signals while acquiring knowledge from an inherently multimodal world. Starting from the visual modality, the underlying premise to accomplish this goal is to unify visual comprehension and generation tasks within an end-to-end framework, that is, to enable the foundation model to SEE and Draw (SEED) at the same time.

We are actively looking for interns to collaborate on this project or related research topics. Please feel free to reach out if you are interested. 👐

Follow us

Have a try

News

[2 Oct 2023] We are excited to unveil SEED-LLaMA, which offers unified multimodal comprehension and generation capabilities, featuring multi-turn in-context emergent capabilities, akin to an AI aide.

Links: Project Page, Tech Report, Code, Models, Online Demo

Features:

We upgrade the SEED tokenizer to v2 that is able to generate more realistic images.

We propose a series of SEED-LLaMA-8/14B models, including the pre-trained models and the ones after instruction tuning. SEED-LLaMA is empowered by the improved SEED-2 tokenizer.

[30 July 2023] SEED-Bench , the most comprehensive MLLM benchmark, has been released!

Links: Tech Report, Code, Data, Leaderboard

Features:

Consists of 19K multiple-choice questions with accurate human annotations.

Spans 12 evaluation dimensions in terms of both spatial and temporal comprehension.

[16 July 2023] We are glad to release the initial version of SEED, a tailored image tokenizer that empowers Large Language Models with the emergent ability to see and draw.

Links: Project Page, Tech Report, Code & Models

Features:

The presented SEED is a discrete image tokenizer. The SEED visual tokens capture high-level semantics while being generated with a 1D causal dependency.

SEED enables LLMs to be trained with multimodal data following the original recipe of text (i.e., next-word prediction), which is mature and scalable.

Design Pattern

The overall training paradigm of a multimodal LLM can be summarised into three stages: visual tokenizer training, multimodal pretraining, and multimodal instruction tuning.

(a) Stage I: A proper visual tokenizer can facilitate the follow-up multimodal training by (i) easing the semantic alignment between visual and word tokens, and (ii) enabling LLM’s original training recipe for multimodal data without specific adaptation for visual tokens. It is intuitive to represent visual signals as a sequence of discrete tokens, the same as words. However, existing visual tokenizers (e.g., VQ-VAE), are far from satisfactory for effective visual understanding and generation. SEED was born to meet this challenge.

(b) Stage II&III: With SEED tokens, LLM is able to perform scalable multimodal autoregression on large-scale textual and visual data. It is expected to capture rich multimodal knowledge from websites or videos during the pre-training stage. Then the pre-trained SEED-LLM can be further tuned to follow human instructions and align with human preference via supervised fine-tuning.

Partial Results

Video demo

SEED-LLaMA-14B (released in Oct 2023), a multimodal AI assistant, demonstrates emergent ability in the multi-turn in-context image and text generation given multimodal instructions.

Quantitative evaluation [up-to-date leaderboard]

The top-10 models (sorted by overall average acc) on the SEED-Bench leaderboard, statistics as of Oct 10, 2023.

Notes

- Citation: If you are using the SEED family of code, models, and data in your research or are inspired by our work, please cite.

- License: Our code, models and data are distributed under Apache 2.0 License.

- Others: Check out other projects of our team.

Acknowledgements

Our core contributors include Yuying Ge, Sijie Zhao, Yixiao Ge, Chen Li, Kun Yi, Xiaohan Ding, Xintao Wang, and Ying Shan, as well as the present or previous interns, Bohao Li, Guangzhi Wang, Jinguo Zhu, Rui Wang, Xiaohu Jiang, and Ziyun Zeng (names listed in alphabetical order). Sincere thanks for their efforts. We would also like to thank Lin Song, Yanpei Cao, Dong Yu, Zhengyou Zhang, and Yu Zeng for their great help, valuable suggestions and support.

The website template was borrowed from Open X-Embodiment.

🌰 SEED (SEE & Draw) Towards Multimodal Generalist

About Us

Follow us

Have a try

News

Design Pattern

Partial Results

Video demo

Quantitative evaluation [up-to-date leaderboard]

Notes

Acknowledgements

🌰 SEED (SEE & Draw)
Towards Multimodal Generalist