Multimodal with GPT-4V and LLaVA

November 6, 2023 · 3 min read

Beibin Li

Senior Research Engineer at Microsoft

LMM Teaser

In Brief:

Introducing the Multimodal Conversable Agent and the LLaVA Agent to enhance LMM functionalities.
Users can input text and images simultaneously using the <img img_path> tag to specify image loading.
Demonstrated through the GPT-4V notebook.
Demonstrated through the LLaVA notebook.

Introduction

Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.

This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. We support the gpt-4-vision-preview model from OpenAI and LLaVA model from Microsoft now.

Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity. GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.

Installation

Incorporate the lmm feature during AutoGen installation:

pip install "autogen-agentchat[lmm]~=0.2"

Subsequently, import the Multimodal Conversable Agent or LLaVA Agent from AutoGen:

from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent  # for GPT-4V
from autogen.agentchat.contrib.llava_agent import LLaVAAgent  # for LLaVA

Usage

A simple syntax has been defined to incorporate both messages and images within a single string.

Example of an in-context learning prompt:

prompt = """You are now an image classifier for facial expressions. Here are
some examples.

<img happy.jpg> depicts a happy expression.
<img http://some_location.com/sad.jpg> represents a sad expression.
<img obama.jpg> portrays a neutral expression.

Now, identify the facial expression of this individual: <img unknown.png>
"""

agent = MultimodalConversableAgent()
user = UserProxyAgent()
user.initiate_chat(agent, message=prompt)

The MultimodalConversableAgent interprets the input prompt, extracting images from local or internet sources.

Advanced Usage

Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface.

For example, the FigureCreator in our GPT-4V notebook and LLaVA notebook integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent). The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs. With human_input_mode=ALWAYS, you can also contribute suggestions for better visualizations.

Reference

Future Enhancements

For further inquiries or suggestions, please open an issue in the AutoGen repository or contact me directly at beibin.li@microsoft.com.

AutoGen will continue to evolve, incorporating more multimodal functionalities such as DALLE model integration, audio interaction, and video comprehension. Stay tuned for these exciting developments.

Introduction​

Installation​

Usage​

Advanced Usage​

Reference​

Future Enhancements​