Towards Omni-modal Representations

ViT-Lens is a versatile framework for Omni-modal representation learning.
With the goal of General Artificial Intelligence (AGI), we aim to develop an AI agent with human-level multi-sensory capabilities to tackle varied user-specified tasks. It is a long-term endeavor. On the way to pursuing omni-modal AI agents, the research community has utilized large-scale web data to make substantial strides in language and vision. However, extending the success to a broader array of modalities remains challenging, especially for the less common modalities. We introduce ViT-Lens, a straightforward yet effective method to advance omni-modal representations. ViT-Lens employs a pre-trained ViT to encode features for diverse modalities. With the rich knowledge within the pre-trained ViT, our method reduces the burden of extensive data collection. We train ViT-Lens for various modalities, including 3D point cloud, depth, audio, tactile, and EEG.

Design Pattern and Results

(I) Understanding across modalities

(Left) Training Pipeline: ViT-Lens extends the capabilities of a pre-trained ViT to diverse modalities. For each novel modality, it firstly employs a Modality Embedding (ModEmbed) and a Lens to learn mapping modality-specific data into an intermediate embedding space. It subsequently employs a set of pre-trained ViT layers to encode the feature. Finally, the output feature is aligned with the feature extracted from the anchor data (image, text, etc.) of the new modality using an off-the-shelf foundation model.

(Right) Performance on Understanding Tasks: ViT-Lens consistently enhances the performance and outperforms previous methods on understanding tasks, such as classification, zero-shot classification (ZS) and linear probing (LP), across 3D point cloud, depth, audio, tactile, and EEG modalities.

(II) ViT-Lens integration to MLLMs

Illustration of training-free ViT-Lens integration: By incorporating the ViT from MLLM as part of the modality encoder and as the foundation model in ViT-Lens training, the yielded modality Lenses can be seamlessly integrated into the MLLMs (e.g., InstructBLIP and SEED) for plug-and-play application.

Plug ViT-Lens into InstructBLIP, enabling Any Instruction Following out-of-the-box.
Plug ViT-Lens into SEED, enabling compound Any-to-Image Generation out-of-the-box.

(3D to Image) Generate an image based on what you see.

(3D to Image) Generate an image based on what you see.

(3D to Image) Generate an image based on what you see.

(3D to Image) Generate an image based on what you see.

(Audio to Image) Generate an image based on what you perceive.

(Audio to Image) Generate an image based on what you perceive.

(Audio to Image) Generate an image based on what you perceive.

(Audio to Image) Generate an image based on what you perceive.

(Audio to Image) Generate an image based on what you perceive.

(Audio to Image) Generate an image based on what you perceive.

(EEG to Image) Generate an image based on what you see.

(EEG to Image) Generate an image based on what you see.

(Tactile to Image) Generate an image based on what you see.

(Tactile to Image) Generate an image based on what you see.

(EEG to Image) Generate an image based on what you see.

(Tactile to Image) Generate an image based on what you see.

(Compound 3D to Image) Add Christmas atmosphere. Generate an image.

(Compound 3D to Image) Add Halloween atmosphere. Generate an image.

(Compound 3D to Image) Add a cat. Generate an image.

(Compound 3D to Image) Add beach in the background. Generate an image.

(Mixed Modalities to Image) Combine these two visual concepts. Generate an image.

(Mixed Modalities to Image) Combine these two visual concepts. Generate an image.


- Citation: If you are using the ViT-Lens code and models in your research or are inspired by our work, please cite.

- License: Our code, models and data are distributed under Apache 2.0 License.

- Others: Check out other projects of our team.


The website template was borrowed from Open X-Embodiment.