Making LLaMA SEE and Draw with SEED Tokenizer

SEED-LLaMA: Compare with SOTAs

We discuss two recent multimodal LLMs (i.e., Emu, Next-GPT) that unify visual comprehension and generation. We further compare SEED-LLaMA to the powerful GPT-4V. Since GPT-4V cannot generate images directly, we use DALLE-3 as an additional tool. GPT-4V and DALLE-3 are connected through text prompts.

The conclusions as below:

Emu is capable of context retention within the multi-turn dialogue to some extent, however, it almost cannot follow human instructions.
Next-GPT can follow human instructions to generate images and text, however, it is unable to comprehend multi-turn context.
GPT-4V is an expert in multimodal comprehension and description. However, since it inherently cannot generate images directly (use DALLE-3 as a plugin instead), semantics that are not yet described by textual prompts are lost when generating images.
SEED-LLaMA is an all-rounder and can properly capture multi-turn semantics and user instructions. Even if the image quality is not as good as DALLE-3 (as we directly use the open-source SD-UNet checkpoints without refinement), we believe this will be made up for in the future.

Example #1 _{(Images may load slowly)}

Citation

@article{ge2023making,
  title={Making LLaMA SEE and Draw with SEED Tokenizer},
  author={Ge, Yuying and Zhao, Sijie and Zeng, Ziyun and Ge, Yixiao and Li, Chen and Wang, Xintao and Shan, Ying},
  journal={arXiv preprint arXiv:2310.01218},
  year={2023}
}

Get to know more about our Project SEED.

Making LLaMA SEE and Draw with SEED Tokenizer

Paper

Code

Models

Demo

SEED-LLaMA: Compare with SOTAs

Example #1 _{(Images may load slowly)}

Citation

Acknowledgements

Making LLaMA SEE and Draw with SEED Tokenizer

Paper

Code

Models

Demo

SEED-LLaMA: Compare with SOTAs

Example #1 (Images may load slowly)

Citation

Acknowledgements

Example #1 _{(Images may load slowly)}