Making LLaMA SEE and Draw with SEED Tokenizer
For any inquiries, please email seed-x@googlegroups.com
SEED-LLaMA: Compare with SOTAs
We discuss two recent multimodal LLMs (i.e.,
Emu,
Next-GPT)
that unify visual comprehension and generation.
We further compare SEED-LLaMA to the powerful
GPT-4V.
Since GPT-4V cannot generate images directly,
we use DALLE-3 as an additional tool.
GPT-4V and DALLE-3 are connected through text prompts.
The conclusions as below:
- Emu is capable of context retention within the multi-turn dialogue to some extent, however, it almost cannot follow human instructions.
- Next-GPT can follow human instructions to generate images and text, however, it is unable to comprehend multi-turn context.
- GPT-4V is an expert in multimodal comprehension and description. However, since it inherently cannot generate images directly (use DALLE-3 as a plugin instead), semantics that are not yet described by textual prompts are lost when generating images.
- SEED-LLaMA is an all-rounder and can properly capture multi-turn semantics and user instructions. Even if the image quality is not as good as DALLE-3 (as we directly use the open-source SD-UNet checkpoints without refinement), we believe this will be made up for in the future.
Example #1 (Images may load slowly)
Citation
@article{ge2023making, title={Making LLaMA SEE and Draw with SEED Tokenizer}, author={Ge, Yuying and Zhao, Sijie and Zeng, Ziyun and Ge, Yixiao and Li, Chen and Wang, Xintao and Shan, Ying}, journal={arXiv preprint arXiv:2310.01218}, year={2023} }
Get to know more about our Project SEED.
Acknowledgements
The website template was borrowed from Open X-Embodiment.