Making LLaMA SEE and Draw with SEED Tokenizer

[Back to home]

SEED-LLaMA: Compare with SOTAs

We discuss two recent multimodal LLMs (i.e., Emu, Next-GPT) that unify visual comprehension and generation. We further compare SEED-LLaMA to the powerful GPT-4V. Since GPT-4V cannot generate images directly, we use DALLE-3 as an additional tool. GPT-4V and DALLE-3 are connected through text prompts.

The conclusions as below:

  • Emu is capable of context retention within the multi-turn dialogue to some extent, however, it almost cannot follow human instructions.
  • Next-GPT can follow human instructions to generate images and text, however, it is unable to comprehend multi-turn context.
  • GPT-4V is an expert in multimodal comprehension and description. However, since it inherently cannot generate images directly (use DALLE-3 as a plugin instead), semantics that are not yet described by textual prompts are lost when generating images.
  • SEED-LLaMA is an all-rounder and can properly capture multi-turn semantics and user instructions. Even if the image quality is not as good as DALLE-3 (as we directly use the open-source SD-UNet checkpoints without refinement), we believe this will be made up for in the future.


Example #1 (Images may load slowly)

Citation

@article{ge2023making,
  title={Making LLaMA SEE and Draw with SEED Tokenizer},
  author={Ge, Yuying and Zhao, Sijie and Zeng, Ziyun and Ge, Yixiao and Li, Chen and Wang, Xintao and Shan, Ying},
  journal={arXiv preprint arXiv:2310.01218},
  year={2023}
}

Get to know more about our Project SEED.

Acknowledgements

The website template was borrowed from Open X-Embodiment.