Skip to main content

· 10 min read
Victor Dibia
Gagan Bansal
Saleema Amershi

AutoGen Studio Playground View: Solving a task with multiple agents that generate a pdf document with images.

AutoGen Studio: Solving a task with multiple agents that generate a pdf document with images.

TL;DR

To help you rapidly prototype multi-agent solutions for your tasks, we are introducing AutoGen Studio, an interface powered by AutoGen. It allows you to:

  • Declaratively define and modify agents and multi-agent workflows through a point and click, drag and drop interface (e.g., you can select the parameters of two agents that will communicate to solve your task).
  • Use our UI to create chat sessions with the specified agents and view results (e.g., view chat history, generated files, and time taken).
  • Explicitly add skills to your agents and accomplish more tasks.
  • Publish your sessions to a local gallery.

AutoGen Studio is open source code here, and can be installed via pip. Give it a try!

pip install autogenstudio

Introduction

The accelerating pace of technology has ushered us into an era where digital assistants (or agents) are becoming integral to our lives. AutoGen has emerged as a leading framework for orchestrating the power of agents. In the spirit of expanding this frontier and democratizing this capability, we are thrilled to introduce a new user-friendly interface: AutoGen Studio.

With AutoGen Studio, users can rapidly create, manage, and interact with agents that can learn, adapt, and collaborate. As we release this interface into the open-source community, our ambition is not only to enhance productivity but to inspire a level of personalized interaction between humans and agents.

Note: AutoGen Studio is meant to help you rapidly prototype multi-agent workflows and demonstrate an example of end user interfaces built with AutoGen. It is not meant to be a production-ready app.

Getting Started with AutoGen Studio

The following guide will help you get AutoGen Studio up and running on your system.

Configuring an LLM Provider

To get started, you need access to a language model. You can get this set up by following the steps in the AutoGen documentation here. Configure your environment with either OPENAI_API_KEY or AZURE_OPENAI_API_KEY.

For example, in your terminal, you would set the API key like this:

export OPENAI_API_KEY=<your_api_key>

You can also specify the model directly in the agent's configuration as shown below.

llm_config = LLMConfig(
config_list=[{
"model": "gpt-4",
"api_key": "<azure_api_key>",
"base_url": "<azure api base url>",
"api_type": "azure",
"api_version": "2024-02-15-preview"
}],
temperature=0,
)

Installation

There are two ways to install AutoGen Studio - from PyPi or from source. We recommend installing from PyPi unless you plan to modify the source code.

  1. Install from PyPi

    We recommend using a virtual environment (e.g., conda) to avoid conflicts with existing Python packages. With Python 3.10 or newer active in your virtual environment, use pip to install AutoGen Studio:

    pip install autogenstudio
  2. Install from Source

    Note: This approach requires some familiarity with building interfaces in React.

    If you prefer to install from source, ensure you have Python 3.10+ and Node.js (version above 14.15.0) installed. Here's how you get started:

    • Clone the AutoGen Studio repository and install its Python dependencies:

      pip install -e .
    • Navigate to the samples/apps/autogen-studio/frontend directory, install dependencies, and build the UI:

      npm install -g gatsby-cli
      npm install --global yarn
      yarn install
      yarn build

    For Windows users, to build the frontend, you may need alternative commands provided in the autogen studio readme.

Running the Application

Once installed, run the web UI by entering the following in your terminal:

autogenstudio ui --port 8081

This will start the application on the specified port. Open your web browser and go to http://localhost:8081/ to begin using AutoGen Studio.

Now that you have AutoGen Studio installed and running, you are ready to explore its capabilities, including defining and modifying agent workflows, interacting with agents and sessions, and expanding agent skills.

What Can You Do with AutoGen Studio?

The AutoGen Studio UI is organized into 3 high level sections - Build, Playground, and Gallery.

Build

Specify Agents.

This section focuses on defining the properties of agents and agent workflows. It includes the following concepts:

Skills: Skills are functions (e.g., Python functions) that describe how to solve a task. In general, a good skill has a descriptive name (e.g. generate_images), extensive docstrings and good defaults (e.g., writing out files to disk for persistence and reuse). You can add new skills to AutoGen Studio via the provided UI. At inference time, these skills are made available to the assistant agent as they address your tasks.

View and add skills.

AutoGen Studio Build View: View, add or edit skills that an agent can leverage in addressing tasks.

Agents: This provides an interface to declaratively specify properties for an AutoGen agent (mirrors most of the members of a base AutoGen conversable agent class).

Agent Workflows: An agent workflow is a specification of a set of agents that can work together to accomplish a task. The simplest version of this is a setup with two agents – a user proxy agent (that represents a user i.e. it compiles code and prints result) and an assistant that can address task requests (e.g., generating plans, writing code, evaluating responses, proposing error recovery steps, etc.). A more complex flow could be a group chat where even more agents work towards a solution.

Playground

AutoGen Studio Playground View: Solving a task with multiple agents that generate a pdf document with images.

AutoGen Studio Playground View: Agents collaborate, use available skills (ability to generate images) to address a user task (generate pdf's).

The playground section is focused on interacting with agent workflows defined in the previous build section. It includes the following concepts:

Session: A session refers to a period of continuous interaction or engagement with an agent workflow, typically characterized by a sequence of activities or operations aimed at achieving specific objectives. It includes the agent workflow configuration, the interactions between the user and the agents. A session can be “published” to a “gallery”.

Chat View: A chat is a sequence of interactions between a user and an agent. It is a part of a session.

This section is focused on sharing and reusing artifacts (e.g., workflow configurations, sessions, etc.).

AutoGen Studio comes with 3 example skills: fetch_profile, find_papers, generate_images. Please feel free to review the repo to learn more about how they work.

The AutoGen Studio API

While AutoGen Studio is a web interface, it is powered by an underlying python API that is reusable and modular. Importantly, we have implemented an API where agent workflows can be declaratively specified (in JSON), loaded and run. An example of the current API is shown below. Please consult the AutoGen Studio repo for more details.

import json
from autogenstudio import AutoGenWorkFlowManager, AgentWorkFlowConfig

# load an agent specification in JSON
agent_spec = json.load(open('agent_spec.json'))

# Create an AutoGen Workflow Configuration from the agent specification
agent_work_flow_config = FlowConfig(**agent_spec)

# Create a Workflow from the configuration
agent_work_flow = AutoGenWorkFlowManager(agent_work_flow_config)

# Run the workflow on a task
task_query = "What is the height of the Eiffel Tower?"
agent_work_flow.run(message=task_query)

Road Map and Next Steps

As we continue to develop and refine AutoGen Studio, the road map below outlines an array of enhancements and new features planned for future releases. Here's what users can look forward to:

  • Complex Agent Workflows: We're working on integrating support for more sophisticated agent workflows, such as GroupChat, allowing for richer interaction between multiple agents or dynamic topologies.
  • Improved User Experience: This includes features like streaming intermediate model output for real-time feedback, better summarization of agent responses, information on costs of each interaction. We will also invest in improving the workflow for composing and reusing agents. We will also explore support for more interactive human in the loop feedback to agents.
  • Expansion of Agent Skills: We will work towards improving the workflow for authoring, composing and reusing agent skills.
  • Community Features: Facilitation of sharing and collaboration within AutoGen Studio user community is a key goal. We're exploring options for sharing sessions and results more easily among users and contributing to a shared repository of skills, agents, and agent workflows.

Contribution Guide

We welcome contributions to AutoGen Studio. We recommend the following general steps to contribute to the project:

  • Review the overall AutoGen project contribution guide.
  • Please review the AutoGen Studio roadmap to get a sense of the current priorities for the project. Help is appreciated especially with Studio issues tagged with help-wanted.
  • Please initiate a discussion on the roadmap issue or a new issue to discuss your proposed contribution.
  • Please review the autogenstudio dev branch here [dev branch].(https://github.com/microsoft/autogen/tree/autogenstudio) and use as a base for your contribution. This way, your contribution will be aligned with the latest changes in the AutoGen Studio project.
  • Submit a pull request with your contribution!
  • If you are modifying AutoGen Studio in vscode, it has its own devcontainer to simplify dev work. See instructions in .devcontainer/README.md on how to use it.
  • Please use the tag studio for any issues, questions, and PRs related to Studio.

FAQ

Q: Where can I adjust the default skills, agent and workflow configurations? A: You can modify agent configurations directly from the UI or by editing the autogentstudio/utils/dbdefaults.json file which is used to initialize the database.

Q: If I want to reset the entire conversation with an agent, how do I go about it? A: To reset your conversation history, you can delete the database.sqlite file. If you need to clear user-specific data, remove the relevant autogenstudio/web/files/user/<user_id_md5hash> folder.

Q: Is it possible to view the output and messages generated by the agents during interactions? A: Yes, you can view the generated messages in the debug console of the web UI, providing insights into the agent interactions. Alternatively, you can inspect the database.sqlite file for a comprehensive record of messages.

Q: Where can I find documentation and support for AutoGen Studio? A: We are constantly working to improve AutoGen Studio. For the latest updates, please refer to the AutoGen Studio Readme. For additional support, please open an issue on GitHub or ask questions on Discord.

Q: Can I use Other Models with AutoGen Studio? Yes. AutoGen standardizes on the openai model api format, and you can use any api server that offers an openai compliant endpoint. In the AutoGen Studio UI, each agent has an llm_config field where you can input your model endpoint details including model name, api key, base url, model type and api version. For Azure OpenAI models, you can find these details in the Azure portal. Note that for Azure OpenAI, the model name is the deployment id or engine, and the model type is "azure". For other OSS models, we recommend using a server such as vllm to instantiate an openai compliant endpoint.

Q: The Server Starts But I Can't Access the UI A: If you are running the server on a remote machine (or a local machine that fails to resolve localhost correstly), you may need to specify the host address. By default, the host address is set to localhost. You can specify the host address using the --host <host> argument. For example, to start the server on port 8081 and local address such that it is accessible from other machines on the network, you can run the following command:

autogenstudio ui --port 8081 --host 0.0.0.0

· 7 min read
Linxin Song
Jieyu Zhang

Overall structure of AutoBuild

TL;DR: Introducing AutoBuild, building multi-agent system automatically, fast, and easily for complex tasks with minimal user prompt required, powered by a new designed class AgentBuilder. AgentBuilder also supports open-source LLMs by leveraging vLLM and FastChat. Checkout example notebooks and source code for reference:

Introduction

In this blog, we introduce AutoBuild, a pipeline that can automatically build multi-agent systems for complex tasks. Specifically, we design a new class called AgentBuilder, which will complete the generation of participant expert agents and the construction of group chat automatically after the user provides descriptions of a building task and an execution task.

AgentBuilder supports open-source models on Hugging Face powered by vLLM and FastChat. Once the user chooses to use open-source LLM, AgentBuilder will set up an endpoint server automatically without any user participation.

Installation

  • AutoGen:
pip install pyautogen[autobuild]
  • (Optional: if you want to use open-source LLMs) vLLM and FastChat
pip install vllm fastchat

Basic Example

In this section, we provide a step-by-step example of how to use AgentBuilder to build a multi-agent system for a specific task.

Step 1: prepare configurations

First, we need to prepare the Agent configurations. Specifically, a config path containing the model name and API key, and a default config for each agent, are required.

config_file_or_env = '/home/elpis_ubuntu/LLM/autogen/OAI_CONFIG_LIST'  # modify path
default_llm_config = {
'temperature': 0
}

Step 2: create an AgentBuilder instance

Then, we create an AgentBuilder instance with the config path and default config. You can also specific the builder model and agent model, which are the LLMs used for building and agent respectively.

from autogen.agentchat.contrib.agent_builder import AgentBuilder

builder = AgentBuilder(config_file_or_env=config_file_or_env, builder_model='gpt-4-1106-preview', agent_model='gpt-4-1106-preview')

Step 3: specify the building task

Specify a building task with a general description. Building task will help the build manager (a LLM) decide what agents should be built. Note that your building task should have a general description of the task. Adding some specific examples is better.

building_task = "Find a paper on arxiv by programming, and analyze its application in some domain. For example, find a latest paper about gpt-4 on arxiv and find its potential applications in software."

Step 4: build group chat agents

Use build() to let the build manager (with a builder_model as backbone) complete the group chat agents generation. If you think coding is necessary for your task, you can use coding=True to add a user proxy (a local code interpreter) into the agent list as:

agent_list, agent_configs = builder.build(building_task, default_llm_config, coding=True)

If coding is not specified, AgentBuilder will determine on its own whether the user proxy should be added or not according to the task. The generated agent_list is a list of AssistantAgent instances. If coding is true, a user proxy (a UserProxyAssistant instance) will be added as the first element to the agent_list. agent_configs is a list of agent configurations including agent name, backbone LLM model, and system message. For example

// an example of agent_configs. AgentBuilder will generate agents with the following configurations.
[
{
"name": "ArXiv_Data_Scraper_Developer",
"model": "gpt-4-1106-preview",
"system_message": "You are now in a group chat. You need to complete a task with other participants. As an ArXiv_Data_Scraper_Developer, your focus is to create and refine tools capable of intelligent search and data extraction from arXiv, honing in on topics within the realms of computer science and medical science. Utilize your proficiency in Python programming to design scripts that navigate, query, and parse information from the platform, generating valuable insights and datasets for analysis. \n\nDuring your mission, it\u2019s not just about formulating queries; your role encompasses the optimization and precision of the data retrieval process, ensuring relevance and accuracy of the information extracted. If you encounter an issue with a script or a discrepancy in the expected output, you are encouraged to troubleshoot and offer revisions to the code you find in the group chat.\n\nWhen you reach a point where the existing codebase does not fulfill task requirements or if the operation of provided code is unclear, you should ask for help from the group chat manager. They will facilitate your advancement by providing guidance or appointing another participant to assist you. Your ability to adapt and enhance scripts based on peer feedback is critical, as the dynamic nature of data scraping demands ongoing refinement of techniques and approaches.\n\nWrap up your participation by confirming the user's need has been satisfied with the data scraping solutions you've provided. Indicate the completion of your task by replying \"TERMINATE\" in the group chat.",
"description": "ArXiv_Data_Scraper_Developer is a specialized software development role requiring proficiency in Python, including familiarity with web scraping libraries such as BeautifulSoup or Scrapy, and a solid understanding of APIs and data parsing. They must possess the ability to identify and correct errors in existing scripts and confidently engage in technical discussions to improve data retrieval processes. The role also involves a critical eye for troubleshooting and optimizing code to ensure efficient data extraction from the ArXiv platform for research and analysis purposes."
},
...
]

Step 5: execute the task

Let agents generated in build() complete the task collaboratively in a group chat.

import autogen

def start_task(execution_task: str, agent_list: list, llm_config: dict):
config_list = autogen.config_list_from_json(config_file_or_env, filter_dict={"model": ["gpt-4-1106-preview"]})

group_chat = autogen.GroupChat(agents=agent_list, messages=[], max_round=12)
manager = autogen.GroupChatManager(
groupchat=group_chat, llm_config={"config_list": config_list, **llm_config}
)
agent_list[0].initiate_chat(manager, message=execution_task)

start_task(
execution_task="Find a recent paper about gpt-4 on arxiv and find its potential applications in software.",
agent_list=agent_list,
llm_config=default_llm_config
)

Step 6 (Optional): clear all agents and prepare for the next task

You can clear all agents generated in this task by the following code if your task is completed or if the next task is largely different from the current task.

builder.clear_all_agents(recycle_endpoint=True)

If the agent's backbone is an open-source LLM, this process will also shut down the endpoint server. More details are in the next section. If necessary, you can use recycle_endpoint=False to retain the previous open-source LLM's endpoint server.

Save and Load

You can save all necessary information of the built group chat agents by

saved_path = builder.save()

Configurations will be saved in JSON format with the following content:

// FILENAME: save_config_TASK_MD5.json
{
"building_task": "Find a paper on arxiv by programming, and analysis its application in some domain. For example, find a latest paper about gpt-4 on arxiv and find its potential applications in software.",
"agent_configs": [
{
"name": "...",
"model": "...",
"system_message": "...",
"description": "..."
},
...
],
"manager_system_message": "...",
"code_execution_config": {...},
"default_llm_config": {...}
}

You can provide a specific filename, otherwise, AgentBuilder will save config to the current path with the generated filename save_config_TASK_MD5.json.

You can load the saved config and skip the building process. AgentBuilder will create agents with those information without prompting the build manager.

new_builder = AgentBuilder(config_file_or_env=config_file_or_env)
agent_list, agent_config = new_builder.load(saved_path)
start_task(...) # skip build()

Use OpenAI Assistant

Assistants API allows you to build AI assistants within your own applications. An Assistant has instructions and can leverage models, tools, and knowledge to respond to user queries. AutoBuild also supports the assistant API by adding use_oai_assistant=True to build().

# Transfer to the OpenAI Assistant API.
agent_list, agent_config = new_builder.build(building_task, default_llm_config, use_oai_assistant=True)
...

(Experimental) Use Open-source LLM

AutoBuild supports open-source LLM by vLLM and FastChat. Check the supported model list here. After satisfying the requirements, you can add an open-source LLM's huggingface repository to the config file,

// Add the LLM's huggingface repo to your config file and use EMPTY as the api_key.
[
...
{
"model": "meta-llama/Llama-2-13b-chat-hf",
"api_key": "EMPTY"
}
]

and specify it when initializing AgentBuilder. AgentBuilder will automatically set up an endpoint server for open-source LLM. Make sure you have sufficient GPUs resources.

Future work/Roadmap

  • Let the builder select the best agents from a given library/database to solve the task.

Summary

We propose AutoBuild with a new class AgentBuilder. AutoBuild can help user solve their complex task with an automatically built multi-agent system. AutoBuild supports open-source LLMs and GPTs API, giving users more flexibility to choose their favorite models. More advanced features are coming soon.

· 10 min read
Julia Kiseleva
Negar Arabzadeh

Fig.1: A verification framework

Fig.1 illustrates the general flow of AgentEval

TL;DR:

  • As a developer of an LLM-powered application, how can you assess the utility it brings to end users while helping them with their tasks?
  • To shed light on the question above, we introduce AgentEval — the first version of the framework to assess the utility of any LLM-powered application crafted to assist users in specific tasks. AgentEval aims to simplify the evaluation process by automatically proposing a set of criteria tailored to the unique purpose of your application. This allows for a comprehensive assessment, quantifying the utility of your application against the suggested criteria.
  • We demonstrate how AgentEval work using math problems dataset as an example in the following notebook. Any feedback would be useful for future development. Please contact us on our Discord.

Introduction

AutoGen aims to simplify the development of LLM-powered multi-agent systems for various applications, ultimately making end users' lives easier by assisting with their tasks. Next, we all yearn to understand how our developed systems perform, their utility for users, and, perhaps most crucially, how we can enhance them. Directly evaluating multi-agent systems poses challenges as current approaches predominantly rely on success metrics – essentially, whether the agent accomplishes tasks. However, comprehending user interaction with a system involves far more than success alone. Take math problems, for instance; it's not merely about the agent solving the problem. Equally significant is its ability to convey solutions based on various criteria, including completeness, conciseness, and the clarity of the provided explanation. Furthermore, success isn't always clearly defined for every task.

Rapid advances in LLMs and multi-agent systems have brought forth many emerging capabilities that we're keen on translating into tangible utilities for end users. We introduce the first version of AgentEval framework - a tool crafted to empower developers in swiftly gauging the utility of LLM-powered applications designed to help end users accomplish the desired task.

Fig.2: An overview of the tasks taxonomy

Fig. 2 provides an overview of the tasks taxonomy

Let's first look into an overview of the suggested task taxonomy that a multi-agent system can be designed for. In general, the tasks can be split into two types, where:

  • Success is not clearly defined - refer to instances when users utilize a system in an assistive manner, seeking suggestions rather than expecting the system to solve the task. For example, a user might request the system to generate an email. In many cases, this generated content serves as a template that the user will later edit. However, defining success precisely for such tasks is relatively complex.
  • Success is clearly defined - refer to instances where we can clearly define whether a system solved the task or not. Consider agents that assist in accomplishing household tasks, where the definition of success is clear and measurable. This category can be further divided into two separate subcategories:
    • The optimal solution exits - these are tasks where only one solution is possible. For example, if you ask your assistant to turn on the light, the success of this task is clearly defined, and there is only one way to accomplish it.
    • Multiple solutions exist - increasingly, we observe situations where multiple trajectories of agent behavior can lead to either success or failure. In such cases, it is crucial to differentiate between the various successful and unsuccessful trajectories. For example, when you ask the agent to suggest you a food recipe or tell you a joke.

In our AgentEval framework, we are currently focusing on tasks where Success is clearly defined. Next, we will introduce the suggested framework.

AgentEval Framework

Our previous research on assistive agents in Minecraft suggested that the most optimal way to obtain human judgments is to present humans with two agents side by side and ask for preferences. In this setup of pairwise comparison, humans can develop criteria to explain why they prefer the behavior of one agent over another. For instance, 'the first agent was faster in execution,' or 'the second agent moves more naturally.' So, the comparative nature led humans to come up with a list of criteria that helps to infer the utility of the task. With this idea in mind, we designed AgentEval (shown in Fig. 1), where we employ LLMs to help us understand, verify, and assess task utility for the multi-agent system. Namely:

  • The goal of CriticAgent is to suggest the list of criteria (Fig. 1), that can be used to assess task utility. This is an example of how CriticAgent is defined using Autogen:
critic = autogen.AssistantAgent(
name="critic",
llm_config={"config_list": config_list},
system_message="""You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be distinguishable, quantifiable, and not redundant.
Convert the evaluation criteria into a dictionary where the keys are the criteria.
The value of each key is a dictionary as follows {"description": criteria description, "accepted_values": possible accepted inputs for this key}
Make sure the keys are criteria for assessing the given task. "accepted_values" include the acceptable inputs for each key that are fine-grained and preferably multi-graded levels. "description" includes the criterion description.
Return only the dictionary."""
)

Next, the critic is given successful and failed examples of the task execution; then, it is able to return a list of criteria (Fig. 1). For reference, use the following notebook.

  • The goal of QuantifierAgent is to quantify each of the suggested criteria (Fig. 1), providing us with an idea of the utility of this system for the given task. Here is an example of how it can be defined:
quantifier = autogen.AssistantAgent(
name="quantifier",
llm_config={"config_list": config_list},
system_message = """You are a helpful assistant. You quantify the output of different tasks based on the given criteria.
The criterion is given in a dictionary format where each key is a distinct criteria.
The value of each key is a dictionary as follows {"description": criteria description , "accepted_values": possible accepted inputs for this key}
You are going to quantify each of the criteria for a given task based on the task description.
Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.
Return only the dictionary."""

)

AgentEval Results based on Math Problems Dataset

As an example, after running CriticAgent, we obtained the following criteria to verify the results for math problem dataset:

CriteriaDescriptionAccepted Values
Problem InterpretationAbility to correctly interpret the problem["completely off", "slightly relevant", "relevant", "mostly accurate", "completely accurate"]
Mathematical MethodologyAdequacy of the chosen mathematical or algorithmic methodology for the question["inappropriate", "barely adequate", "adequate", "mostly effective", "completely effective"]
Calculation CorrectnessAccuracy of calculations made and solutions given["completely incorrect", "mostly incorrect", "neither", "mostly correct", "completely correct"]
Explanation ClarityClarity and comprehensibility of explanations, including language use and structure["not at all clear", "slightly clear", "moderately clear", "very clear", "completely clear"]
Code EfficiencyQuality of code in terms of efficiency and elegance["not at all efficient", "slightly efficient", "moderately efficient", "very efficient", "extremely efficient"]
Code CorrectnessCorrectness of the provided code["completely incorrect", "mostly incorrect", "partly correct", "mostly correct", "completely correct"]

Then, after running QuantifierAgent, we obtained the results presented in Fig. 3, where you can see three models:

  • AgentChat
  • ReAct
  • GPT-4 Vanilla Solver

Lighter colors represent estimates for failed cases, and brighter colors show how discovered criteria were quantified.

Fig.3: Results based on overall math problems dataset _s stands for successful cases, _f - stands for failed cases

Fig.3 presents results based on overall math problems dataset _s stands for successful cases, _f - stands for failed cases

We note that while applying agentEval to math problems, the agent was not exposed to any ground truth information about the problem. As such, this figure illustrates an estimated performance of the three different agents, namely, Autogen (blue), Gpt-4 (red), and ReAct (green). We observe that by comparing the performance of any of the three agents in successful cases (dark bars of any color) versus unsuccessful cases (lighter version of the same bar), we note that AgentEval was able to assign higher quantification to successful cases than that of failed ones. This observation verifies AgentEval's ability for task utility prediction. Additionally, AgentEval allows us to go beyond just a binary definition of success, enabling a more in-depth comparison between successful and failed cases.

It's important not only to identify what is not working but also to recognize what and why actually went well.

Limitations and Future Work

The current implementation of AgentEval has a number of limitations which are planning to overcome in the future:

  • The list of criteria varies per run (unless you store a seed). We would recommend to run CriticAgent at least two times, and pick criteria you think is important for your domain.
  • The results of the QuantifierAgent can vary with each run, so we recommend conducting multiple runs to observe the extent of result variations.

To mitigate the limitations mentioned above, we are working on VerifierAgent, whose goal is to stabilize the results and provide additional explanations.

Summary

CriticAgent and QuantifierAgent can be applied to the logs of any type of application, providing you with an in-depth understanding of the utility your solution brings to the user for a given task.

We would love to hear about how AgentEval works for your application. Any feedback would be useful for future development. Please contact us on our Discord.

Previous Research

@InProceedings{pmlr-v176-kiseleva22a,
title = "Interactive Grounded Language Understanding in a Collaborative Environment: IGLU 2021",
author = "Kiseleva, Julia and Li, Ziming and Aliannejadi, Mohammad and Mohanty, Shrestha and ter Hoeve, Maartje and Burtsev, Mikhail and Skrynnik, Alexey and Zholus, Artem and Panov, Aleksandr and Srinet, Kavya and Szlam, Arthur and Sun, Yuxuan and Hofmann, Katja and C{\^o}t{\'e}, Marc-Alexandre and Awadallah, Ahmed and Abdrazakov, Linar and Churin, Igor and Manggala, Putra and Naszadi, Kata and van der Meer, Michiel and Kim, Taewoon",
booktitle = "Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track",
pages = "146--161",
year = 2022,
editor = "Kiela, Douwe and Ciccone, Marco and Caputo, Barbara",
volume = 176,
series = "Proceedings of Machine Learning Research",
month = "06--14 Dec",
publisher = "PMLR",
pdf = {https://proceedings.mlr.press/v176/kiseleva22a/kiseleva22a.pdf},
url = {https://proceedings.mlr.press/v176/kiseleva22a.html}.
}
@InProceedings{pmlr-v220-kiseleva22a,
title = "Interactive Grounded Language Understanding in a Collaborative Environment: Retrospective on Iglu 2022 Competition",
author = "Kiseleva, Julia and Skrynnik, Alexey and Zholus, Artem and Mohanty, Shrestha and Arabzadeh, Negar and C\^{o}t\'e, Marc-Alexandre and Aliannejadi, Mohammad and Teruel, Milagro and Li, Ziming and Burtsev, Mikhail and ter Hoeve, Maartje and Volovikova, Zoya and Panov, Aleksandr and Sun, Yuxuan and Srinet, Kavya and Szlam, Arthur and Awadallah, Ahmed and Rho, Seungeun and Kwon, Taehwan and Wontae Nam, Daniel and Bivort Haiek, Felipe and Zhang, Edwin and Abdrazakov, Linar and Qingyam, Guo and Zhang, Jason and Guo, Zhibin",
booktitle = "Proceedings of the NeurIPS 2022 Competitions Track",
pages = "204--216",
year = 2022,
editor = "Ciccone, Marco and Stolovitzky, Gustavo and Albrecht, Jacob",
volume = 220,
series = "Proceedings of Machine Learning Research",
month = "28 Nov--09 Dec",
publisher = "PMLR",
pdf = "https://proceedings.mlr.press/v220/kiseleva22a/kiseleva22a.pdf",
url = "https://proceedings.mlr.press/v220/kiseleva22a.html".
}

· 3 min read
Gagan Bansal

OpenAI Assistant

AutoGen enables collaboration among multiple ChatGPTs for complex tasks.

TL;DR

OpenAI assistants are now integrated into AutoGen via GPTAssistantAgent. This enables multiple OpenAI assistants, which form the backend of the now popular GPTs, to collaborate and tackle complex tasks. Checkout example notebooks for reference:

Introduction

Earlier last week, OpenAI introduced GPTs, giving users ability to create custom ChatGPTs tailored for them. But what if these individual GPTs could collaborate to do even more? Fortunately, because of AutoGen, this is now a reality! AutoGen has been pioneering agents and supporting multi-agent workflows since earlier this year, and now (starting with version 0.2.0b5) we are introducing compatibility with the Assistant API, which is currently in beta preview.

To accomplish this, we've added a new (experimental) agent called the GPTAssistantAgent that lets you seamlessly add these new OpenAI assistants into AutoGen-based multi-agent workflows. This integration shows great potential and synergy, and we plan to continue enhancing it.

Installation

pip install pyautogen==0.2.0b5

Basic Example

Here's a basic example that uses a UserProxyAgent to allow an interface with the GPTAssistantAgent.

First, import the new agent and setup config_list:

from autogen import config_list_from_json
from autogen.agentchat.contrib.gpt_assistant_agent import GPTAssistantAgent
from autogen import UserProxyAgent

config_list = config_list_from_json("OAI_CONFIG_LIST")

Then simply define the OpenAI assistant agent and give it the task!

# creates new assistant using Assistant API
gpt_assistant = GPTAssistantAgent(
name="assistant",
llm_config={
"config_list": config_list,
"assistant_id": None
})

user_proxy = UserProxyAgent(name="user_proxy",
code_execution_config={
"work_dir": "coding"
},
human_input_mode="NEVER")

user_proxy.initiate_chat(gpt_assistant, message="Print hello world")

GPTAssistantAgent supports both creating new OpenAI assistants or reusing existing assistants (e.g, by providing an assistant_id).

Code Interpreter Example

GPTAssistantAgent allows you to specify an OpenAI tools (e.g., function calls, code interpreter, etc). The example below enables an assistant that can use OpenAI code interpreter to solve tasks.

# creates new assistant using Assistant API
gpt_assistant = GPTAssistantAgent(
name="assistant",
llm_config={
"config_list": config_list,
"assistant_id": None,
"tools": [
{
"type": "code_interpreter"
}
],
})

user_proxy = UserProxyAgent(name="user_proxy",
code_execution_config={
"work_dir": "coding"
},
human_input_mode="NEVER")

user_proxy.initiate_chat(gpt_assistant, message="Print hello world")

Checkout more examples here.

Limitations and Future Work

  • Group chat managers using GPT assistant are pending.
  • GPT assistants with multimodal capabilities haven't been released yet but we are committed to support them.

Acknowledgements

GPTAssistantAgent was made possible through collaboration with @IANTHEREAL, Jiale Liu, Yiran Wu, Qingyun Wu, Chi Wang, and many other AutoGen maintainers.

· 5 min read
Jieyu Zhang

system

TL;DR:

  • Introducing the EcoAssistant, which is designed to solve user queries more accurately and affordably.
  • We show how to let the LLM assistant agent leverage external API to solve user query.
  • We show how to reduce the cost of using GPT models via Assistant Hierarchy.
  • We show how to leverage the idea of Retrieval-augmented Generation (RAG) to improve the success rate via Solution Demonstration.

EcoAssistant

In this blog, we introduce the EcoAssistant, a system built upon AutoGen with the goal of solving user queries more accurately and affordably.

Problem setup

Recently, users have been using conversational LLMs such as ChatGPT for various queries. Reports indicate that 23% of ChatGPT user queries are for knowledge extraction purposes. Many of these queries require knowledge that is external to the information stored within any pre-trained large language models (LLMs). These tasks can only be completed by generating code to fetch necessary information via external APIs that contain the requested information. In the table below, we show three types of user queries that we aim to address in this work.

DatasetAPIExample query
PlacesGoogle PlacesI’m looking for a 24-hour pharmacy in Montreal, can you find one for me?
WeatherWeather APIWhat is the current cloud coverage in Mumbai, India?
StockAlpha Vantage Stock APICan you give me the opening price of Microsoft for the month of January 2023?

Leveraging external APIs

To address these queries, we first build a two-agent system based on AutoGen, where the first agent is a LLM assistant agent (AssistantAgent in AutoGen) that is responsible for proposing and refining the code and the second agent is a code executor agent (UserProxyAgent in AutoGen) that would extract the generated code and execute it, forwarding the output back to the LLM assistant agent. A visualization of the two-agent system is shown below.

chat

To instruct the assistant agent to leverage external APIs, we only need to add the API name/key dictionary at the beginning of the initial message. The template is shown below, where the red part is the information of APIs and black part is user query.

template

Importantly, we don't want to reveal our real API key to the assistant agent for safety concerns. Therefore, we use a fake API key to replace the real API key in the initial message. In particular, we generate a random token (e.g., 181dbb37) for each API key and replace the real API key with the token in the initial message. Then, when the code executor execute the code, the fake API key would be automatically replaced by the real API key.

Solution Demonstration

In most practical scenarios, queries from users would appear sequentially over time. Our EcoAssistant leverages past success to help the LLM assistants address future queries via Solution Demonstration. Specifically, whenever a query is deemed successfully resolved by user feedback, we capture and store the query and the final generated code snippet. These query-code pairs are saved in a specialized vector database. When new queries appear, EcoAssistant retrieves the most similar query from the database, which is then appended with the associated code to the initial prompt for the new query, serving as a demonstration. The new template of initial message is shown below, where the blue part corresponds to the solution demonstration.

template

We found that this utilization of past successful query-code pairs improves the query resolution process with fewer iterations and enhances the system's performance.

Assistant Hierarchy

LLMs usually have different prices and performance, for example, GPT-3.5-turbo is much cheaper than GPT-4 but also less accurate. Thus, we propose the Assistant Hierarchy to reduce the cost of using LLMs. The core idea is that we use the cheaper LLMs first and only use the more expensive LLMs when necessary. By this way, we are able to reduce the reliance on expensive LLMs and thus reduce the cost. In particular, given multiple LLMs, we initiate one assistant agent for each and start the conversation with the most cost-effective LLM assistant. If the conversation between the current LLM assistant and the code executor concludes without successfully resolving the query, EcoAssistant would then restart the conversation with the next more expensive LLM assistant in the hierarchy. We found that this strategy significantly reduces costs while still effectively addressing queries.

A Synergistic Effect

We found that the Assistant Hierarchy and Solution Demonstration of EcoAssistant have a synergistic effect. Because the query-code database is shared by all LLM assistants, even without specialized design, the solution from more powerful LLM assistant (e.g., GPT-4) could be later retrieved to guide weaker LLM assistant (e.g., GPT-3.5-turbo). Such a synergistic effect further improves the performance and reduces the cost of EcoAssistant.

Experimental Results

We evaluate EcoAssistant on three datasets: Places, Weather, and Stock. When comparing it with a single GPT-4 assistant, we found that EcoAssistant achieves a higher success rate with a lower cost as shown in the figure below. For more details about the experimental results and other experiments, please refer to our paper.

exp

Further reading

Please refer to our paper and codebase for more details about EcoAssistant.

If you find this blog useful, please consider citing:

@article{zhang2023ecoassistant,
title={EcoAssistant: Using LLM Assistant More Affordably and Accurately},
author={Zhang, Jieyu and Krishna, Ranjay and Awadallah, Ahmed H and Wang, Chi},
journal={arXiv preprint arXiv:2310.03046},
year={2023}
}

· 3 min read
Beibin Li

LMM Teaser

In Brief:

  • Introducing the Multimodal Conversable Agent and the LLaVA Agent to enhance LMM functionalities.
  • Users can input text and images simultaneously using the <img img_path> tag to specify image loading.
  • Demonstrated through the GPT-4V notebook.
  • Demonstrated through the LLaVA notebook.

Introduction

Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.

This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. We support the gpt-4-vision-preview model from OpenAI and LLaVA model from Microsoft now.

Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity. GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.

Installation

Incorporate the lmm feature during AutoGen installation:

pip install "pyautogen[lmm]"

Subsequently, import the Multimodal Conversable Agent or LLaVA Agent from AutoGen:

from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent  # for GPT-4V
from autogen.agentchat.contrib.llava_agent import LLaVAAgent # for LLaVA

Usage

A simple syntax has been defined to incorporate both messages and images within a single string.

Example of an in-context learning prompt:

prompt = """You are now an image classifier for facial expressions. Here are
some examples.

<img happy.jpg> depicts a happy expression.
<img http://some_location.com/sad.jpg> represents a sad expression.
<img obama.jpg> portrays a neutral expression.

Now, identify the facial expression of this individual: <img unknown.png>
"""

agent = MultimodalConversableAgent()
user = UserProxyAgent()
user.initiate_chat(agent, message=prompt)

The MultimodalConversableAgent interprets the input prompt, extracting images from local or internet sources.

Advanced Usage

Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface.

For example, the FigureCreator in our GPT-4V notebook and LLaVA notebook integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent). The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs. With human_input_mode=ALWAYS, you can also contribute suggestions for better visualizations.

Reference

Future Enhancements

For further inquiries or suggestions, please open an issue in the AutoGen repository or contact me directly at beibin.li@microsoft.com.

AutoGen will continue to evolve, incorporating more multimodal functionalities such as DALLE model integration, audio interaction, and video comprehension. Stay tuned for these exciting developments.

· 17 min read
Ricky Loynd

Teachable Agent Architecture

TL;DR:

  • We introduce Teachable Agents so that users can teach their LLM-based assistants new facts, preferences, and skills.
  • We showcase examples of teachable agents learning and later recalling facts, preferences, and skills in subsequent chats.

Introduction

Conversational assistants based on LLMs can remember the current chat with the user, and can also demonstrate in-context learning of user teachings during the conversation. But the assistant's memories and learnings are lost once the chat is over, or when a single chat grows too long for the LLM to handle effectively. Then in subsequent chats the user is forced to repeat any necessary instructions over and over.

Teachability addresses these limitations by persisting user teachings across chat boundaries in long-term memory implemented as a vector database. Instead of copying all of memory into the context window, which would eat up valuable space, individual memories (called memos) are retrieved into context as needed. This allows the user to teach frequently used facts and skills to the teachable agent just once, and have it recall them in later chats.

Any instantiated agent that inherits from ConversableAgent can be made teachable by instantiating a Teachability object and calling its add_to_agent(agent) method. In order to make effective decisions about memo storage and retrieval, the Teachability object calls an instance of TextAnalyzerAgent (another AutoGen agent) to identify and reformulate text as needed for remembering facts, preferences, and skills. Note that this adds extra LLM calls involving a relatively small number of tokens, which can add a few seconds to the time a user waits for each response.

Run It Yourself

AutoGen contains four code examples that use Teachability.

  1. Run chat_with_teachable_agent.py to converse with a teachable agent.

  2. Run test_teachable_agent.py for quick unit testing of a teachable agent.

  3. Use the Jupyter notebook agentchat_teachability.ipynb to step through examples discussed below.

  4. Use the Jupyter notebook agentchat_teachable_oai_assistants.ipynb to make arbitrary OpenAI Assistants teachable through GPTAssistantAgent.

Basic Usage of Teachability

  1. Install dependencies

Please install pyautogen with the [teachable] option before using Teachability.

pip install "pyautogen[teachable]"
  1. Import agents
from autogen import UserProxyAgent, config_list_from_json
from autogen.agentchat.contrib.capabilities.teachability import Teachability
from autogen import ConversableAgent # As an example
  1. Create llm_config
# Load LLM inference endpoints from an env variable or a file
# See https://microsoft.github.io/autogen/docs/FAQ#set-your-api-endpoints
# and OAI_CONFIG_LIST_sample
filter_dict = {"model": ["gpt-4"]} # GPT-3.5 is less reliable than GPT-4 at learning from user feedback.
config_list = config_list_from_json(env_or_file="OAI_CONFIG_LIST", filter_dict=filter_dict)
llm_config={"config_list": config_list, "timeout": 120}
  1. Create the agents

# Start by instantiating any agent that inherits from ConversableAgent, which we use directly here for simplicity.
teachable_agent = ConversableAgent(
name="teachable_agent", # The name can be anything.
llm_config=llm_config
)

# Instantiate a Teachability object. Its parameters are all optional.
teachability = Teachability(
reset_db=False, # Use True to force-reset the memo DB, and False to use an existing DB.
path_to_db_dir="./tmp/interactive/teachability_db" # Can be any path, but teachable agents in a group chat require unique paths.
)

# Now add teachability to the agent.
teachability.add_to_agent(teachable_agent)

# For this test, create a user proxy agent as usual.
user = UserProxyAgent("user", human_input_mode="ALWAYS")
  1. Chat with the teachable agent
# This function will return once the user types 'exit'.
teachable_agent.initiate_chat(user, message="Hi, I'm a teachable user assistant! What's on your mind?")

Example 1 - Learning user info

A user can teach the agent facts about themselves. (Note that due to their finetuning, LLMs can be reluctant to admit that they know personal information.)

Loading previous memory (if any) from disk.
teachable_agent (to user):

Greetings, I'm a teachable user assistant! What's on your mind today?

--------------------------------------------------------------------------------
Provide feedback to teachable_agent. Press enter to skip and use auto-reply, or type 'exit' to end the conversation: My name is Ricky
user (to teachable_agent):

My name is Ricky

--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...
teachable_agent (to user):

Hello Ricky! How can I assist you today?

--------------------------------------------------------------------------------
Provide feedback to teachable_agent. Press enter to skip and use auto-reply, or type 'exit' to end the conversation: exit

In a later conversation, the user can check whether the teachable agent remembers their name. (For readability, the user prompts and some logged notices are not repeated below.)

--------------------------------------------------------------------------------
user (to teachable_agent):

what's my name?

--------------------------------------------------------------------------------
teachable_agent (to user):

Hi Ricky! How can I help you today?

--------------------------------------------------------------------------------

Example 2 - Learning new facts

A user can teach the agent more complex, related facts.

--------------------------------------------------------------------------------
user (to teachable_agent):

What is the Vicuna model?

--------------------------------------------------------------------------------
teachable_agent (to user):

I'm sorry, but I don't have information about the "Vicuna model" in my current database. Could you please provide some context, description, or details about this model so I can better assist you?

--------------------------------------------------------------------------------
user (to teachable_agent):

Vicuna is a 13B-parameter language model released by Meta.

--------------------------------------------------------------------------------
teachable_agent (to user):

Thank you for providing the information. The Vicuna model is a large-scale language model developed by Meta. It consists of 13 billion parameters, which is what makes it able to understand and generate human-like text across various topics. Parameters in a neural network are the weights and biases assigned to each connection between the neurons, and having more parameters enables the model to learn and capture more complex patterns and relationships in data. As an AI language model, Vicuna can be utilized in various language-related tasks, such as text completion, translation, summarization, and more.

--------------------------------------------------------------------------------
user (to teachable_agent):

What is the Orca model?

--------------------------------------------------------------------------------
teachable_agent (to user):

The Orca model is a yet-to-be-released language model by Meta, referring to the Research Preview API release of Meta's advanced language model. While I don't have specific details about the Orca model at this time, it is expected that the model would have an even larger number of parameters to further enhance its capabilities compared to its predecessor, Vicuna.

As AI language models evolve, they generally become more efficient at understanding and generating human-like text. This makes them more effective in completing a wide range of tasks, including text completion, translation, summarization, and Q&A, as well as providing more accurate and coherent results in various applications.

--------------------------------------------------------------------------------
user (to teachable_agent):

Orca is a 13B-parameter language model released by Microsoft. It outperforms Vicuna on most tasks.

--------------------------------------------------------------------------------
teachable_agent (to user):

Thank you for the clarification. Orca is a 13-billion-parameter language model developed by Microsoft. It is designed to outperform Vicuna, the language model released by Meta. With a large number of parameters, Orca is capable of understanding and generating human-like text across diverse subjects, making it highly useful in various applications.

Orca's advanced capabilities allow it to excel in tasks such as text completion, translation, summarization, Q&A, and more. Its performance improvement over Vicuna highlights the rapid advancements in natural language processing (NLP) research and the development of increasingly capable AI language models.

Then in a later chat the teachable agent can answer questions about the facts it has been taught. (Remember to first close the previous chat by typing 'exit'.)

--------------------------------------------------------------------------------
user (to teachable_agent):

How does the Vicuna model compare to the Orca model?

--------------------------------------------------------------------------------
teachable_agent (to user):

The Vicuna and Orca models are both 13B-parameter language models released by Meta and Microsoft, respectively. In terms of performance, Orca has been reported to outperform Vicuna on most tasks. However, without detailed information about specific tasks and benchmarks, it is difficult to provide a more comprehensive comparison. Generally speaking, both models are advanced language models that aim to provide high-quality natural language understanding and generation, but Orca appears to have an edge in terms of overall performance.

--------------------------------------------------------------------------------

Example 3 - Learning user preferences

A user can teach the agent how they prefer to have things done.

Be aware that a message like the next one cannot be entered as a single message through a command line because it contains a newline character. Such messages can be entered in a Jupyter notebook, or through some UI layer like that of ChatGPT.

--------------------------------------------------------------------------------
user (to teachable_agent):

Please summarize this abstract.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, Chi Wang
AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

--------------------------------------------------------------------------------
teachable_agent (to user):

AutoGen is an open-source framework designed to enable developers to create LLM applications with multiple customizable agents that can converse with each other to complete tasks. These agents can operate using different combinations of LLMs, human inputs, and tools, allowing developers to define agent interaction behaviors flexibly. AutoGen supports programming flexible conversation patterns using both natural language and code, making it suitable for building diverse applications with varying complexities and LLM capacities. Its effectiveness has been demonstrated through empirical studies across various domains including mathematics, coding, operations research, decision-making, and entertainment.

--------------------------------------------------------------------------------
Please summarize this abstract. When I'm summarizing an abstract, I try to make the summary contain just three short bullet points: the title, the innovation, and the key empirical results.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, Chi Wang
AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

--------------------------------------------------------------------------------
teachable_agent (to user):

- Title: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Innovation: Open-source framework for creating customizable LLM applications through agent conversations, supporting various modes and interaction behaviors.
- Key Empirical Results: Demonstrated effectiveness across diverse application domains, including mathematics, coding, question answering, and more.

--------------------------------------------------------------------------------

Then in later chats the teacher doesn't need to reiterate their detailed preferences.

--------------------------------------------------------------------------------
user (to teachable_agent):

Please summarize this abstract.

Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

--------------------------------------------------------------------------------
teachable_agent (to user):

- Title: Sparks of Artificial General Intelligence: Early experiments with GPT-4
- Innovation: GPT-4, an LLM with remarkable capabilities, demonstrates human-level performance across various domains, like math, coding, vision, medicine, law, and psychology.
- Key results: GPT-4 significantly surpasses prior models, suggesting it may be an early version of AGI; limitations and challenges toward deeper AGI are also discussed.

--------------------------------------------------------------------------------

Example 4 - Learning new skills

Users can extend the teachable agent's capabilities by teaching it new skills for accomplishing challenging tasks. It usually works best to first describe the task, then (in the same turn) provide a hint or advice for approaching the task.

The Sparks of AGI paper evaluated GPT-4 on math problems like the following, which it could only solve 32% of the time. We first show a failure case, then teach the agent a strategy which lifts GPT-4's success rate above 95%.

--------------------------------------------------------------------------------
user (to teachable_agent):

Consider the identity:
9 * 4 + 6 * 6 = 72
Can you modify exactly one integer (and not more than that!) on the left hand side of the equation so the right hand side becomes 99?
-Let's think step-by-step, write down a plan, and then write down your solution as: "The solution is: A * B + C * D".

--------------------------------------------------------------------------------
Step 1: Identify the current equation and the target value on the right-hand side.
Current equation: 9 * 4 + 6 * 6 = 72
Target value: 99

Step 2: Determine what numbers can be changed on the left-hand side.
Possible changes: 9, 4, 6 (first occurrence), and 6 (second occurrence)

Step 3: Evaluate whether each change in the numbers on the left-hand side can result in the target value.
9 -> If changed, it would create a larger product.
4 -> If changed, the product would increase or decrease by more than the amount needed to reach 99.
6 (first occurrence) -> If increased to 7, then the equation would be 9 * 4 + 7 * 6 = 94, which is still not 99.
6 (second occurrence) -> If increased to 7, then the equation would be 9 * 4 + 6 * 7 = 99.

Step 4: Write down the solution.
The solution is: 9 * 4 + 6 * 7

--------------------------------------------------------------------------------
user (to teachable_agent):

Consider the identity:
9 * 4 + 6 * 6 = 72
Can you modify exactly one integer (and not more than that!) on the left hand side of the equation so the right hand side becomes 99?
-Let's think step-by-step, write down a plan, and then write down your solution as: "The solution is: A * B + C * D".

Here's some advice that may help:
1. Let E denote the original number on the right.
2. Let F denote the final number on the right.
3. Calculate the difference between the two, G = F - E.
4. Examine the numbers on the left one by one until finding one that divides evenly into G, where negative integers are allowed.
5. Calculate J = G / H. This is the number of times that H divides into G.
6. Verify that J is an integer, and that H * J = G.
7. Find the number on the left which is multiplied by H, and call it K.
8. Change K to K + J.
9. Recompute the value on the left, and verify that it equals F.
Finally, write down your solution as: "The solution is: A * B + C * D".

--------------------------------------------------------------------------------
teachable_agent (to user):

Step 1: Identify the original number (E) and the final number (F) on the right-hand side.
E = 72
F = 99

Step 2: Calculate the difference between the two, G = F - E.
G = 99 - 72 = 27

Step 3: Examine the numbers on the left one by one until finding one that divides evenly into G, where negative integers are allowed.
9 (divides evenly into 27)
4 (does not)
6 (first occurrence, does not)
6 (second occurrence, does not)

Step 4: Calculate J = G / H. This is the number of times that H divides into G.
H = 9 (since 9 divides evenly into 27)
J = 27 / 9 = 3

Step 5: Verify that J is an integer, and that H * J = G.
H * J = 9 * 3 = 27
J = 3 (which is an integer)

Step 6: Find the number on the left which is multiplied by H, and call it K.
Number on the leftside of the equation that is multiplied by 9: 4
Therefore, K = 4

Step 7: Change K to K + J.
K + J = 4 + 3 = 7

Step 8: Recompute the value on the left, and verify that it equals F.
9 * 7 + 6 * 6 = 63 + 36 = 99 (which is equal to F)

Step 9: Write down the solution.
The solution is: 9 * 7 + 6 * 6

--------------------------------------------------------------------------------

In a later chat the user doesn't need to repeat the detailed advice.

--------------------------------------------------------------------------------
user (to teachable_agent):

Consider the identity:
8 * 3 + 7 * 9 = 87
Can you modify exactly one integer (and not more than that!) on the left hand side of the equation so the right hand side becomes 59?
-Let's think step-by-step, write down a plan, and then write down your solution as: "The solution is: A * B + C * D".

--------------------------------------------------------------------------------
teachable_agent (to user):

Let's follow the advice to modify one integer on the left hand side of the equation to make the right hand side become 59:

1. Let E denote the original number on the right: E = 87.
2. Let F denote the final number on the right: F = 59.
3. Calculate the difference between the two, G = F - E: G = 59 - 87 = -28.
4. Examine the numbers on the left one by one to find one that divides evenly into G: 8, 3, 7, 9. Since 7 divides -28 evenly, we will work with 7 (H = 7).
5. Calculate J = G / H: J = -28 / 7 = -4.
6. Verify that J is an integer, and that H * J = G: J is an integer, and 7 * (-4) = -28.
7. Find the number on the left which is multiplied by H (7), and call it K: K = 9.
8. Change K to K + J: K_new = 9 + (-4) = 5.
9. Recompute the value on the left, and verify that it equals F: (8 * 3) + (7 * 5) = 24 + 35 = 59.

The solution is: 8 * 3 + 7 * 5

--------------------------------------------------------------------------------

Planned improvements

  • Understanding user instructions distributed over multiple turns.
  • Learning from the agent's own experience, to reduce dependence on explicit user teachings.
  • Learning skills built on top of previously learned skills.

Conclusion

Teachability is still under active research and development. For any problems you find or improvements you have in mind, please join our discussions in this repo and on our Discord channel. We look forward to seeing how you and the rest of the community can use and improve teachable agents in AutoGen!

· 10 min read
Li Jiang

Last update: April 4, 2024; AutoGen version: v0.2.21

RAG Architecture

TL;DR:

  • We introduce RetrieveUserProxyAgent and RetrieveAssistantAgent, RAG agents of AutoGen that allows retrieval-augmented generation, and its basic usage.
  • We showcase customizations of RAG agents, such as customizing the embedding function, the text split function and vector database.
  • We also showcase two advanced usage of RAG agents, integrating with group chat and building a Chat application with Gradio.

Introduction

Retrieval augmentation has emerged as a practical and effective approach for mitigating the intrinsic limitations of LLMs by incorporating external documents. In this blog post, we introduce RAG agents of AutoGen that allows retrieval-augmented generation. The system consists of two agents: a Retrieval-augmented User Proxy agent, called RetrieveUserProxyAgent, and a Retrieval-augmented Assistant agent, called RetrieveAssistantAgent, both of which are extended from built-in agents from AutoGen. The overall architecture of the RAG agents is shown in the figure above.

To use Retrieval-augmented Chat, one needs to initialize two agents including Retrieval-augmented User Proxy and Retrieval-augmented Assistant. Initializing the Retrieval-Augmented User Proxy necessitates specifying a path to the document collection. Subsequently, the Retrieval-Augmented User Proxy can download the documents, segment them into chunks of a specific size, compute embeddings, and store them in a vector database. Once a chat is initiated, the agents collaboratively engage in code generation or question-answering adhering to the procedures outlined below:

  1. The Retrieval-Augmented User Proxy retrieves document chunks based on the embedding similarity, and sends them along with the question to the Retrieval-Augmented Assistant.
  2. The Retrieval-Augmented Assistant employs an LLM to generate code or text as answers based on the question and context provided. If the LLM is unable to produce a satisfactory response, it is instructed to reply with “Update Context” to the Retrieval-Augmented User Proxy.
  3. If a response includes code blocks, the Retrieval-Augmented User Proxy executes the code and sends the output as feedback. If there are no code blocks or instructions to update the context, it terminates the conversation. Otherwise, it updates the context and forwards the question along with the new context to the Retrieval-Augmented Assistant. Note that if human input solicitation is enabled, individuals can proactively send any feedback, including Update Context”, to the Retrieval-Augmented Assistant.
  4. If the Retrieval-Augmented Assistant receives “Update Context”, it requests the next most similar chunks of documents as new context from the Retrieval-Augmented User Proxy. Otherwise, it generates new code or text based on the feedback and chat history. If the LLM fails to generate an answer, it replies with “Update Context” again. This process can be repeated several times. The conversation terminates if no more documents are available for the context.

Basic Usage of RAG Agents

  1. Install dependencies

Please install pyautogen with the [retrievechat] option before using RAG agents.

pip install "pyautogen[retrievechat]"

RetrieveChat can handle various types of documents. By default, it can process plain text and PDF files, including formats such as 'txt', 'json', 'csv', 'tsv', 'md', 'html', 'htm', 'rtf', 'rst', 'jsonl', 'log', 'xml', 'yaml', 'yml' and 'pdf'. If you install unstructured, additional document types such as 'docx', 'doc', 'odt', 'pptx', 'ppt', 'xlsx', 'eml', 'msg', 'epub' will also be supported.

  • Install unstructured in ubuntu
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils
pip install unstructured[all-docs]

You can find a list of all supported document types by using autogen.retrieve_utils.TEXT_FORMATS.

  1. Import Agents
import autogen
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
  1. Create an 'RetrieveAssistantAgent' instance named "assistant" and an 'RetrieveUserProxyAgent' instance named "ragproxyagent"
assistant = RetrieveAssistantAgent(
name="assistant",
system_message="You are a helpful assistant.",
llm_config=llm_config,
)

ragproxyagent = RetrieveUserProxyAgent(
name="ragproxyagent",
retrieve_config={
"task": "qa",
"docs_path": "https://raw.githubusercontent.com/microsoft/autogen/main/README.md",
},
)
  1. Initialize Chat and ask a question
assistant.reset()
ragproxyagent.initiate_chat(assistant, message=ragproxyagent.message_generator, problem="What is autogen?")

Output is like:

--------------------------------------------------------------------------------
assistant (to ragproxyagent):

AutoGen is a framework that enables the development of large language model (LLM) applications using multiple agents that can converse with each other to solve tasks. The agents are customizable, conversable, and allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

--------------------------------------------------------------------------------
  1. Create a UserProxyAgent and ask the same question
assistant.reset()
userproxyagent = autogen.UserProxyAgent(name="userproxyagent")
userproxyagent.initiate_chat(assistant, message="What is autogen?")

Output is like:

--------------------------------------------------------------------------------
assistant (to userproxyagent):

In computer software, autogen is a tool that generates program code automatically, without the need for manual coding. It is commonly used in fields such as software engineering, game development, and web development to speed up the development process and reduce errors. Autogen tools typically use pre-programmed rules, templates, and data to create code for repetitive tasks, such as generating user interfaces, database schemas, and data models. Some popular autogen tools include Visual Studio's Code Generator and Unity's Asset Store.

--------------------------------------------------------------------------------

You can see that the output of UserProxyAgent is not related to our autogen since the latest info of autogen is not in ChatGPT's training data. The output of RetrieveUserProxyAgent is correct as it can perform retrieval-augmented generation based on the given documentation file.

Customizing RAG Agents

RetrieveUserProxyAgent is customizable with retrieve_config. There are several parameters to configure based on different use cases. In this section, we'll show how to customize embedding function, text split function and vector database.

Customizing Embedding Function

By default, Sentence Transformers and its pretrained models will be used to compute embeddings. It's possible that you want to use OpenAI, Cohere, HuggingFace or other embedding functions.

  • OpenAI
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="YOUR_API_KEY",
model_name="text-embedding-ada-002"
)

ragproxyagent = RetrieveUserProxyAgent(
name="ragproxyagent",
retrieve_config={
"task": "qa",
"docs_path": "https://raw.githubusercontent.com/microsoft/autogen/main/README.md",
"embedding_function": openai_ef,
},
)
  • HuggingFace
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="YOUR_API_KEY",
model_name="sentence-transformers/all-MiniLM-L6-v2"
)

More examples can be found here.

Customizing Text Split Function

Before we can store the documents into a vector database, we need to split the texts into chunks. Although we have implemented a flexible text splitter in autogen, you may still want to use different text splitters. There are also some existing text split tools which are good to reuse.

For example, you can use all the text splitters in langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter

recur_spliter = RecursiveCharacterTextSplitter(separators=["\n", "\r", "\t"])

ragproxyagent = RetrieveUserProxyAgent(
name="ragproxyagent",
retrieve_config={
"task": "qa",
"docs_path": "https://raw.githubusercontent.com/microsoft/autogen/main/README.md",
"custom_text_split_function": recur_spliter.split_text,
},
)

Customizing Vector Database

We are using chromadb as the default vector database, you can also replace it with any other vector database by simply overriding the function retrieve_docs of RetrieveUserProxyAgent.

For example, you can use Qdrant as below:

# Creating qdrant client
from qdrant_client import QdrantClient

client = QdrantClient(url="***", api_key="***")

# Wrapping RetrieveUserProxyAgent
from litellm import embedding as test_embedding
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from qdrant_client.models import SearchRequest, Filter, FieldCondition, MatchText

class QdrantRetrieveUserProxyAgent(RetrieveUserProxyAgent):
def query_vector_db(
self,
query_texts: List[str],
n_results: int = 10,
search_string: str = "",
**kwargs,
) -> Dict[str, Union[List[str], List[List[str]]]]:
# define your own query function here
embed_response = test_embedding('text-embedding-ada-002', input=query_texts)

all_embeddings: List[List[float]] = []

for item in embed_response['data']:
all_embeddings.append(item['embedding'])

search_queries: List[SearchRequest] = []

for embedding in all_embeddings:
search_queries.append(
SearchRequest(
vector=embedding,
filter=Filter(
must=[
FieldCondition(
key="page_content",
match=MatchText(
text=search_string,
)
)
]
),
limit=n_results,
with_payload=True,
)
)

search_response = client.search_batch(
collection_name="{your collection name}",
requests=search_queries,
)

return {
"ids": [[scored_point.id for scored_point in batch] for batch in search_response],
"documents": [[scored_point.payload.get('page_content', '') for scored_point in batch] for batch in search_response],
"metadatas": [[scored_point.payload.get('metadata', {}) for scored_point in batch] for batch in search_response]
}

def retrieve_docs(self, problem: str, n_results: int = 20, search_string: str = "", **kwargs):
results = self.query_vector_db(
query_texts=[problem],
n_results=n_results,
search_string=search_string,
**kwargs,
)

self._results = results


# Use QdrantRetrieveUserProxyAgent
qdrantragagent = QdrantRetrieveUserProxyAgent(
name="ragproxyagent",
human_input_mode="NEVER",
max_consecutive_auto_reply=2,
retrieve_config={
"task": "qa",
},
)

qdrantragagent.retrieve_docs("What is Autogen?", n_results=10, search_string="autogen")

Advanced Usage of RAG Agents

Integrate with other agents in a group chat

To use RetrieveUserProxyAgent in a group chat is almost the same as you use it in a two agents chat. The only thing is that you need to initialize the chat with RetrieveUserProxyAgent. The RetrieveAssistantAgent is not necessary in a group chat.

However, you may want to initialize the chat with another agent in some cases. To leverage the best of RetrieveUserProxyAgent, you'll need to call it from a function.

boss = autogen.UserProxyAgent(
name="Boss",
is_termination_msg=termination_msg,
human_input_mode="TERMINATE",
system_message="The boss who ask questions and give tasks.",
)

boss_aid = RetrieveUserProxyAgent(
name="Boss_Assistant",
is_termination_msg=termination_msg,
system_message="Assistant who has extra content retrieval power for solving difficult problems.",
human_input_mode="NEVER",
max_consecutive_auto_reply=3,
retrieve_config={
"task": "qa",
},
code_execution_config=False, # we don't want to execute code in this case.
)

coder = autogen.AssistantAgent(
name="Senior_Python_Engineer",
is_termination_msg=termination_msg,
system_message="You are a senior python engineer. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},
)

pm = autogen.AssistantAgent(
name="Product_Manager",
is_termination_msg=termination_msg,
system_message="You are a product manager. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},
)

reviewer = autogen.AssistantAgent(
name="Code_Reviewer",
is_termination_msg=termination_msg,
system_message="You are a code reviewer. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},
)

def retrieve_content(
message: Annotated[
str,
"Refined message which keeps the original meaning and can be used to retrieve content for code generation and question answering.",
],
n_results: Annotated[int, "number of results"] = 3,
) -> str:
boss_aid.n_results = n_results # Set the number of results to be retrieved.
# Check if we need to update the context.
update_context_case1, update_context_case2 = boss_aid._check_update_context(message)
if (update_context_case1 or update_context_case2) and boss_aid.update_context:
boss_aid.problem = message if not hasattr(boss_aid, "problem") else boss_aid.problem
_, ret_msg = boss_aid._generate_retrieve_user_reply(message)
else:
_context = {"problem": message, "n_results": n_results}
ret_msg = boss_aid.message_generator(boss_aid, None, _context)
return ret_msg if ret_msg else message

for caller in [pm, coder, reviewer]:
d_retrieve_content = caller.register_for_llm(
description="retrieve content for code generation and question answering.", api_style="function"
)(retrieve_content)

for executor in [boss, pm]:
executor.register_for_execution()(d_retrieve_content)

groupchat = autogen.GroupChat(
agents=[boss, pm, coder, reviewer],
messages=[],
max_round=12,
speaker_selection_method="round_robin",
allow_repeat_speaker=False,
)

llm_config = {"config_list": config_list, "timeout": 60, "temperature": 0}
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

# Start chatting with the boss as this is the user proxy agent.
boss.initiate_chat(
manager,
message="How to use spark for parallel training in FLAML? Give me sample code.",
)

Build a Chat application with Gradio

Now, let's wrap it up and make a Chat application with AutoGen and Gradio.

RAG ChatBot with AutoGen

# Initialize Agents
def initialize_agents(config_list, docs_path=None):
...
return assistant, ragproxyagent

# Initialize Chat
def initiate_chat(config_list, problem, queue, n_results=3):
...
assistant.reset()
try:
ragproxyagent.a_initiate_chat(
assistant, problem=problem, silent=False, n_results=n_results
)
messages = ragproxyagent.chat_messages
messages = [messages[k] for k in messages.keys()][0]
messages = [m["content"] for m in messages if m["role"] == "user"]
print("messages: ", messages)
except Exception as e:
messages = [str(e)]
queue.put(messages)

# Wrap AutoGen part into a function
def chatbot_reply(input_text):
"""Chat with the agent through terminal."""
queue = mp.Queue()
process = mp.Process(
target=initiate_chat,
args=(config_list, input_text, queue),
)
process.start()
try:
messages = queue.get(timeout=TIMEOUT)
except Exception as e:
messages = [str(e) if len(str(e)) > 0 else "Invalid Request to OpenAI, please check your API keys."]
finally:
try:
process.terminate()
except:
pass
return messages

...

# Set up UI with Gradio
with gr.Blocks() as demo:
...
assistant, ragproxyagent = initialize_agents(config_list)

chatbot = gr.Chatbot(
[],
elem_id="chatbot",
bubble_full_width=False,
avatar_images=(None, (os.path.join(os.path.dirname(__file__), "autogen.png"))),
# height=600,
)

txt_input = gr.Textbox(
scale=4,
show_label=False,
placeholder="Enter text and press enter",
container=False,
)

with gr.Row():
txt_model = gr.Dropdown(
label="Model",
choices=[
"gpt-4",
"gpt-35-turbo",
"gpt-3.5-turbo",
],
allow_custom_value=True,
value="gpt-35-turbo",
container=True,
)
txt_oai_key = gr.Textbox(
label="OpenAI API Key",
placeholder="Enter key and press enter",
max_lines=1,
show_label=True,
value=os.environ.get("OPENAI_API_KEY", ""),
container=True,
type="password",
)
...

clear = gr.ClearButton([txt_input, chatbot])

...

if __name__ == "__main__":
demo.launch(share=True)

The online app and the source code are hosted in HuggingFace. Feel free to give it a try!

Read More

You can check out more example notebooks for RAG use cases:

· 3 min read
Jiale Liu

TL;DR: We demonstrate how to use autogen for local LLM application. As an example, we will initiate an endpoint using FastChat and perform inference on ChatGLMv2-6b.

Preparations

Clone FastChat

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly.

git clone https://github.com/lm-sys/FastChat.git
cd FastChat

Download checkpoint

ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion parameters. ChatGLM2-6B is its second-generation version.

Before downloading from HuggingFace Hub, you need to have Git LFS installed.

git clone https://huggingface.co/THUDM/chatglm2-6b

Initiate server

First, launch the controller

python -m fastchat.serve.controller

Then, launch the model worker(s)

python -m fastchat.serve.model_worker --model-path chatglm2-6b

Finally, launch the RESTful API server

python -m fastchat.serve.openai_api_server --host localhost --port 8000

Normally this will work. However, if you encounter error like this, commenting out all the lines containing finish_reason in fastchat/protocol/api_protocol.py and fastchat/protocol/openai_api_protocol.py will fix the problem. The modified code looks like:

class CompletionResponseChoice(BaseModel):
index: int
text: str
logprobs: Optional[int] = None
# finish_reason: Optional[Literal["stop", "length"]]

class CompletionResponseStreamChoice(BaseModel):
index: int
text: str
logprobs: Optional[float] = None
# finish_reason: Optional[Literal["stop", "length"]] = None

Interact with model using oai.Completion (requires openai<1)

Now the models can be directly accessed through openai-python library as well as autogen.oai.Completion and autogen.oai.ChatCompletion.

from autogen import oai

# create a text completion request
response = oai.Completion.create(
config_list=[
{
"model": "chatglm2-6b",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL", # just a placeholder
}
],
prompt="Hi",
)
print(response)

# create a chat completion request
response = oai.ChatCompletion.create(
config_list=[
{
"model": "chatglm2-6b",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL",
}
],
messages=[{"role": "user", "content": "Hi"}]
)
print(response)

If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s).

interacting with multiple local LLMs

If you would like to interact with multiple LLMs on your local machine, replace the model_worker step above with a multi model variant:

python -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.3 \
--model-names vicuna-7b-v1.3 \
--model-path chatglm2-6b \
--model-names chatglm2-6b

The inference code would be:

from autogen import oai

# create a chat completion request
response = oai.ChatCompletion.create(
config_list=[
{
"model": "chatglm2-6b",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL",
},
{
"model": "vicuna-7b-v1.3",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL",
}
],
messages=[{"role": "user", "content": "Hi"}]
)
print(response)

For Further Reading

· 8 min read
Yiran Wu

MathChat WorkFlow TL;DR:

  • We introduce MathChat, a conversational framework leveraging Large Language Models (LLMs), specifically GPT-4, to solve advanced mathematical problems.
  • MathChat improves LLM's performance on challenging math problem-solving, outperforming basic prompting and other strategies by about 6%. The improvement was especially notable in the Algebra category, with a 15% increase in accuracy.
  • Despite the advancement, GPT-4 still struggles to solve very challenging math problems, even with effective prompting strategies. Further improvements are needed, such as the development of more specific assistant models or the integration of new tools and prompts.

Recent Large Language Models (LLMs) like GTP-3.5 and GPT-4 have demonstrated astonishing abilities over previous models on various tasks, such as text generation, question answering, and code generation. Moreover, these models can communicate with humans through conversations and remember previous contexts, making it easier for humans to interact with them. These models play an increasingly important role in our daily lives assisting people with different tasks, such as writing emails, summarizing documents, and writing code.

In this blog post, we probe into the problem-solving capabilities of LLMs. Specifically, we are interested in their capabilities to solve advanced math problems, which could be representative of a broader class of problems that require precise reasoning and also have deterministic solutions.

We introduce MathChat, a conversational framework designed for solving challenging math problems with LLMs. This framework takes advantage of the chat-optimized feature of state-of-the-art LLMs, where a user proxy agent and an LLM assistant work together to tackle math problems. We also test previous prompting techniques for comparison.

The MathChat Framework

MathChat simulates a conversation between the LLM assistant and a user proxy agent. As the name indicates, the user proxy agent acts as a proxy for the user, which is responsible for communicating with the LLM assistant and continuing the conversation in a desired manner.

The proxy agent first presents a math problem to the LLM assistant, framed by an initial prompt that guides the assistant in how it should work collaboratively to solve the problem. We incorporate existing prompting methods such as CoT (chain-of-thought) and tool-using are incorporated in the initial prompt. Key components of the prompt include:

  • Tool-using Prompt: This guides the LLM assistant to use Python code in the correct format to address the problem, allowing the user proxy agent to parse the code and provide the appropriate results.

  • Problem-Solving Strategy Selection Prompt: The assistant is instructed to choose one of three potential problem-solving strategies, including:

    1. Writing a Python program to solve the problem directly.
    2. Solving the problem directly without using Python, which exercises LLMs' inherent reasoning capabilities.
    3. Solving the problem step by step with Python. This should be used when the first two strategies aren't suitable.
  • Final Answer Encapsulation Prompt: This part instructs the assistant to put the final answer in \boxed.

The prompt design in MathChat is flexible and can easily be refined to include other tools, such as Wolfram Alpha.

Let's take a look at an example between the User Proxy Agent and the LLM Assistant (GPT-4). The conversation focuses on how to solve inequality using Python. (The conversation is modified for readability.)

ParticipantDialogue
User Proxy Agent(Initial Prompt: Let's solve a math problem with Python...) Problem: Find all x that satisfy the inequality (2x+10)(x+3) < (3x+9)(x+8). Express your answer in interval notation.
LLM assistantThe key idea to solve this problem is to find the critical points, which are the points where the inequality is equal. Then, test the intervals between the critical points to find the interval for which the inequality holds.
LLM assistant (continued)Here's Python code to find the critical points: from sympy import symbols, Eq, solve; x = symbols("x"); lhs = (2*x + 10)*(x + 3); rhs = (3*x + 9)*(x + 8); equation = Eq(lhs, rhs); critical_points = solve(equation, x); critical_points
User Proxy AgentThe critical points are: [-14, -3]
LLM assistantWe found the critical points to be x = -14 and x = -3. Now we'll test the intervals between these critical points as well as points before and after the critical points. The intervals are: x < -14, -14 < x < -3, x > -3.
LLM assistant (continued)Here's Python code to test the inequality for each interval: from sympy import symbols; x = symbols("x"); inequality = (2*x + 10)*(x + 3) < (3*x + 9)*(x + 8); test_points = [-15, -5, 0]; intervals = []; for point in test_points: if inequality.subs(x, point): intervals.append(point); intervals
User Proxy AgentThe intervals where the inequality holds true are for test points: [-15, 0]
LLM assistantWe have tested the intervals and found that the inequality holds for: 1. x < -14 (test point: -15) 2. x > -3 (test point: 0). Therefore, the solution to the inequality (2x+10)(x+3) < (3x+9)(x+8) is x ∈ (-∞, -14) ∪ (-3, ∞)

Experiment Setup

We evaluate the improvement brought by MathChat.

For the experiment, we focus on the level-5 problems from the MATH dataset, which are composed of high school competition problems. These problems include the application of theorems and complex equation derivation and are challenging even for undergraduate students. We evaluate 6 of 7 categories from the dataset (excluding Geometry): Prealgebra, Algebra, Number Theory, Counting and Probability, Intermediate Algebra, and Precalculus.

We evaluate GPT-4 and use the default configuration of the OpenAI API. To access the final performance, we manually compare the final answer with the correct answer. For the vanilla prompt, Program Synthesis, and MathChat, we have GPT-4 enclose the final answer in \boxed, and we take the return of the function in PoT as the final answer.

We also evaluate the following methods for comparison:

  1. Vanilla prompting: Evaluates GPT-4's direct problem-solving capability. The prompt used is: " Solve the problem carefully. Put the final answer in \boxed".

  2. Program of Thoughts (PoT): Uses a zero-shot PoT prompt that requests the model to create a Solver function to solve the problem and return the final answer.

  3. Program Synthesis (PS) prompting: Like PoT, it prompts the model to write a program to solve the problem. The prompt used is: "Write a program that answers the following question: {Problem}".

Experiment Results

The accuracy on all the problems with difficulty level-5 from different categories of the MATH dataset with different methods is shown below:

Result

We found that compared to basic prompting, which demonstrates the innate capabilities of GPT-4, utilizing Python within the context of PoT or PS strategy improved the overall accuracy by about 10%. This increase was mostly seen in categories involving more numerical manipulations, such as Counting & Probability and Number Theory, and in more complex categories like Intermediate Algebra and Precalculus.

For categories like Algebra and Prealgebra, PoT and PS showed little improvement, and in some instances, even led to a decrease in accuracy. However, MathChat was able to enhance total accuracy by around 6% compared to PoT and PS, showing competitive performance across all categories. Remarkably, MathChat improved accuracy in the Algebra category by about 15% over other methods. Note that categories like Intermediate Algebra and Precalculus remained challenging for all methods, with only about 20% of problems solved accurately.

The code for experiments can be found at this repository. We now provide an implementation of MathChat using the interactive agents in AutoGen. See this notebook for example usage.

Future Directions

Despite MathChat's improvements over previous methods, the results show that complex math problem is still challenging for recent powerful LLMs, like GPT-4, even with help from external tools.

Further work can be done to enhance this framework or math problem-solving in general:

  • Although enabling the model to use tools like Python can reduce calculation errors, LLMs are still prone to logic errors. Methods like self-consistency (Sample several solutions and take a major vote on the final answer), or self-verification (use another LLM instance to check whether an answer is correct) might improve the performance.
  • Sometimes, whether the LLM can solve the problem depends on the plan it uses. Some plans require less computation and logical reasoning, leaving less room for mistakes.
  • MathChat has the potential to be adapted into a copilot system, which could assist users with math problems. This system could allow users to be more involved in the problem-solving process, potentially enhancing learning.

For Further Reading

Are you working on applications that involve math problem-solving? Would you appreciate additional research or support on the application of LLM-based agents for math problem-solving? Please join our Discord server for discussion.