Skip to main content

Vision input for the Planner

· 3 min read
Liqun Li
Principal Researcher

Introduction

We have supported vision input for the Planner role in TaskWeaver. The Planner role is responsible for generating the high-level plan for the task. The vision input is a new type of input that contains images. This feature is useful when the task requires visual understanding.

How vision input is supported in TaskWeaver

In TaskWeaver, we added a new role called ImageReader to read images and provide the image url (for remote images) or the image encoded in base64 (for local images) to the Planner role. To have this new role, you need to include it in the project configure file as follows:

{
"session.roles": [
"planner",
"code_interpreter",
"image_reader"
]
}

The ImageReader role takes the path or the url of the image as input and prepares a response Post for the Planner role. As described here for Azure OpenAI API, if the image is local, ImageReader need to encode the image in base64 and pass it to the API. If the image is remote, ImageReader need to provide the url of the image. The Planner role can then use the image information for various tasks.

An example

Let's ask the agent to describe any uploaded image.

image_reader

The flow of the conversation is as follows:

In the example above, the User talks to the agent in Web UI and uploads an image. TaskWeaver also support providing the image path in console mode, either using the /load command or just include the image path in the input message.

Extension

If you look into the implementation of the ImageReader role, you will find that it is quite simple. The key logic is shown in the following code snippet:

if image_url.startswith("http"):
image_content = image_url
attachment_message = f"Image from {image_url}."
else:
if os.path.isabs(image_url):
image_content = local_image_to_data_url(image_url)
else:
image_content = local_image_to_data_url(os.path.join(self.session_metadata.execution_cwd, image_url))
attachment_message = f"Image from {image_url} encoded as a Base64 data URL."

post_proxy.update_attachment(
message=attachment_message,
type=AttachmentType.image_url,
extra={"image_url": image_content},
is_end=True,
)

After the image url is obtained, the ImageReader role will encode the image in base64 if the image is local. Then, it will create an attachment in the response Post and pass the image content to the Planner role. To achieve this, the attachment is created with the type AttachmentType.image_url and the image content is passed as extra data with the key image_url.

Therefore, if we want to support other scenarios with vision input, we can extend the ImageReader role by adding more logic to handle different types of contents. One example is to support reading a document with text and images. We can add an attachment for each image in the document and pass the list of attachments to the Planner role.