Skip to main content

2 posts tagged with "RAG"

View All Tags

· 5 min read
Jieyu Zhang



  • Introducing the EcoAssistant, which is designed to solve user queries more accurately and affordably.
  • We show how to let the LLM assistant agent leverage external API to solve user query.
  • We show how to reduce the cost of using GPT models via Assistant Hierarchy.
  • We show how to leverage the idea of Retrieval-augmented Generation (RAG) to improve the success rate via Solution Demonstration.


In this blog, we introduce the EcoAssistant, a system built upon AutoGen with the goal of solving user queries more accurately and affordably.

Problem setup

Recently, users have been using conversational LLMs such as ChatGPT for various queries. Reports indicate that 23% of ChatGPT user queries are for knowledge extraction purposes. Many of these queries require knowledge that is external to the information stored within any pre-trained large language models (LLMs). These tasks can only be completed by generating code to fetch necessary information via external APIs that contain the requested information. In the table below, we show three types of user queries that we aim to address in this work.

DatasetAPIExample query
PlacesGoogle PlacesI’m looking for a 24-hour pharmacy in Montreal, can you find one for me?
WeatherWeather APIWhat is the current cloud coverage in Mumbai, India?
StockAlpha Vantage Stock APICan you give me the opening price of Microsoft for the month of January 2023?

Leveraging external APIs

To address these queries, we first build a two-agent system based on AutoGen, where the first agent is a LLM assistant agent (AssistantAgent in AutoGen) that is responsible for proposing and refining the code and the second agent is a code executor agent (UserProxyAgent in AutoGen) that would extract the generated code and execute it, forwarding the output back to the LLM assistant agent. A visualization of the two-agent system is shown below.


To instruct the assistant agent to leverage external APIs, we only need to add the API name/key dictionary at the beginning of the initial message. The template is shown below, where the red part is the information of APIs and black part is user query.


Importantly, we don't want to reveal our real API key to the assistant agent for safety concerns. Therefore, we use a fake API key to replace the real API key in the initial message. In particular, we generate a random token (e.g., 181dbb37) for each API key and replace the real API key with the token in the initial message. Then, when the code executor execute the code, the fake API key would be automatically replaced by the real API key.

Solution Demonstration

In most practical scenarios, queries from users would appear sequentially over time. Our EcoAssistant leverages past success to help the LLM assistants address future queries via Solution Demonstration. Specifically, whenever a query is deemed successfully resolved by user feedback, we capture and store the query and the final generated code snippet. These query-code pairs are saved in a specialized vector database. When new queries appear, EcoAssistant retrieves the most similar query from the database, which is then appended with the associated code to the initial prompt for the new query, serving as a demonstration. The new template of initial message is shown below, where the blue part corresponds to the solution demonstration.


We found that this utilization of past successful query-code pairs improves the query resolution process with fewer iterations and enhances the system's performance.

Assistant Hierarchy

LLMs usually have different prices and performance, for example, GPT-3.5-turbo is much cheaper than GPT-4 but also less accurate. Thus, we propose the Assistant Hierarchy to reduce the cost of using LLMs. The core idea is that we use the cheaper LLMs first and only use the more expensive LLMs when necessary. By this way, we are able to reduce the reliance on expensive LLMs and thus reduce the cost. In particular, given multiple LLMs, we initiate one assistant agent for each and start the conversation with the most cost-effective LLM assistant. If the conversation between the current LLM assistant and the code executor concludes without successfully resolving the query, EcoAssistant would then restart the conversation with the next more expensive LLM assistant in the hierarchy. We found that this strategy significantly reduces costs while still effectively addressing queries.

A Synergistic Effect

We found that the Assistant Hierarchy and Solution Demonstration of EcoAssistant have a synergistic effect. Because the query-code database is shared by all LLM assistants, even without specialized design, the solution from more powerful LLM assistant (e.g., GPT-4) could be later retrieved to guide weaker LLM assistant (e.g., GPT-3.5-turbo). Such a synergistic effect further improves the performance and reduces the cost of EcoAssistant.

Experimental Results

We evaluate EcoAssistant on three datasets: Places, Weather, and Stock. When comparing it with a single GPT-4 assistant, we found that EcoAssistant achieves a higher success rate with a lower cost as shown in the figure below. For more details about the experimental results and other experiments, please refer to our paper.


Further reading

Please refer to our paper and codebase for more details about EcoAssistant.

If you find this blog useful, please consider citing:

title={EcoAssistant: Using LLM Assistant More Affordably and Accurately},
author={Zhang, Jieyu and Krishna, Ranjay and Awadallah, Ahmed H and Wang, Chi},
journal={arXiv preprint arXiv:2310.03046},

· 10 min read
Li Jiang

Last update: April 4, 2024; AutoGen version: v0.2.21

RAG Architecture


  • We introduce RetrieveUserProxyAgent and RetrieveAssistantAgent, RAG agents of AutoGen that allows retrieval-augmented generation, and its basic usage.
  • We showcase customizations of RAG agents, such as customizing the embedding function, the text split function and vector database.
  • We also showcase two advanced usage of RAG agents, integrating with group chat and building a Chat application with Gradio.


Retrieval augmentation has emerged as a practical and effective approach for mitigating the intrinsic limitations of LLMs by incorporating external documents. In this blog post, we introduce RAG agents of AutoGen that allows retrieval-augmented generation. The system consists of two agents: a Retrieval-augmented User Proxy agent, called RetrieveUserProxyAgent, and a Retrieval-augmented Assistant agent, called RetrieveAssistantAgent, both of which are extended from built-in agents from AutoGen. The overall architecture of the RAG agents is shown in the figure above.

To use Retrieval-augmented Chat, one needs to initialize two agents including Retrieval-augmented User Proxy and Retrieval-augmented Assistant. Initializing the Retrieval-Augmented User Proxy necessitates specifying a path to the document collection. Subsequently, the Retrieval-Augmented User Proxy can download the documents, segment them into chunks of a specific size, compute embeddings, and store them in a vector database. Once a chat is initiated, the agents collaboratively engage in code generation or question-answering adhering to the procedures outlined below:

  1. The Retrieval-Augmented User Proxy retrieves document chunks based on the embedding similarity, and sends them along with the question to the Retrieval-Augmented Assistant.
  2. The Retrieval-Augmented Assistant employs an LLM to generate code or text as answers based on the question and context provided. If the LLM is unable to produce a satisfactory response, it is instructed to reply with “Update Context” to the Retrieval-Augmented User Proxy.
  3. If a response includes code blocks, the Retrieval-Augmented User Proxy executes the code and sends the output as feedback. If there are no code blocks or instructions to update the context, it terminates the conversation. Otherwise, it updates the context and forwards the question along with the new context to the Retrieval-Augmented Assistant. Note that if human input solicitation is enabled, individuals can proactively send any feedback, including Update Context”, to the Retrieval-Augmented Assistant.
  4. If the Retrieval-Augmented Assistant receives “Update Context”, it requests the next most similar chunks of documents as new context from the Retrieval-Augmented User Proxy. Otherwise, it generates new code or text based on the feedback and chat history. If the LLM fails to generate an answer, it replies with “Update Context” again. This process can be repeated several times. The conversation terminates if no more documents are available for the context.

Basic Usage of RAG Agents

  1. Install dependencies

Please install pyautogen with the [retrievechat] option before using RAG agents.

pip install "pyautogen[retrievechat]"

RetrieveChat can handle various types of documents. By default, it can process plain text and PDF files, including formats such as 'txt', 'json', 'csv', 'tsv', 'md', 'html', 'htm', 'rtf', 'rst', 'jsonl', 'log', 'xml', 'yaml', 'yml' and 'pdf'. If you install unstructured, additional document types such as 'docx', 'doc', 'odt', 'pptx', 'ppt', 'xlsx', 'eml', 'msg', 'epub' will also be supported.

  • Install unstructured in ubuntu
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils
pip install unstructured[all-docs]

You can find a list of all supported document types by using autogen.retrieve_utils.TEXT_FORMATS.

  1. Import Agents
import autogen
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
  1. Create an 'RetrieveAssistantAgent' instance named "assistant" and an 'RetrieveUserProxyAgent' instance named "ragproxyagent"
assistant = RetrieveAssistantAgent(
system_message="You are a helpful assistant.",

ragproxyagent = RetrieveUserProxyAgent(
"task": "qa",
"docs_path": "",
  1. Initialize Chat and ask a question
ragproxyagent.initiate_chat(assistant, message=ragproxyagent.message_generator, problem="What is autogen?")

Output is like:

assistant (to ragproxyagent):

AutoGen is a framework that enables the development of large language model (LLM) applications using multiple agents that can converse with each other to solve tasks. The agents are customizable, conversable, and allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

  1. Create a UserProxyAgent and ask the same question
userproxyagent = autogen.UserProxyAgent(name="userproxyagent")
userproxyagent.initiate_chat(assistant, message="What is autogen?")

Output is like:

assistant (to userproxyagent):

In computer software, autogen is a tool that generates program code automatically, without the need for manual coding. It is commonly used in fields such as software engineering, game development, and web development to speed up the development process and reduce errors. Autogen tools typically use pre-programmed rules, templates, and data to create code for repetitive tasks, such as generating user interfaces, database schemas, and data models. Some popular autogen tools include Visual Studio's Code Generator and Unity's Asset Store.


You can see that the output of UserProxyAgent is not related to our autogen since the latest info of autogen is not in ChatGPT's training data. The output of RetrieveUserProxyAgent is correct as it can perform retrieval-augmented generation based on the given documentation file.

Customizing RAG Agents

RetrieveUserProxyAgent is customizable with retrieve_config. There are several parameters to configure based on different use cases. In this section, we'll show how to customize embedding function, text split function and vector database.

Customizing Embedding Function

By default, Sentence Transformers and its pretrained models will be used to compute embeddings. It's possible that you want to use OpenAI, Cohere, HuggingFace or other embedding functions.

  • OpenAI
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(

ragproxyagent = RetrieveUserProxyAgent(
"task": "qa",
"docs_path": "",
"embedding_function": openai_ef,
  • HuggingFace
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(

More examples can be found here.

Customizing Text Split Function

Before we can store the documents into a vector database, we need to split the texts into chunks. Although we have implemented a flexible text splitter in autogen, you may still want to use different text splitters. There are also some existing text split tools which are good to reuse.

For example, you can use all the text splitters in langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter

recur_spliter = RecursiveCharacterTextSplitter(separators=["\n", "\r", "\t"])

ragproxyagent = RetrieveUserProxyAgent(
"task": "qa",
"docs_path": "",
"custom_text_split_function": recur_spliter.split_text,

Customizing Vector Database

We are using chromadb as the default vector database, you can also replace it with any other vector database by simply overriding the function retrieve_docs of RetrieveUserProxyAgent.

For example, you can use Qdrant as below:

# Creating qdrant client
from qdrant_client import QdrantClient

client = QdrantClient(url="***", api_key="***")

# Wrapping RetrieveUserProxyAgent
from litellm import embedding as test_embedding
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from qdrant_client.models import SearchRequest, Filter, FieldCondition, MatchText

class QdrantRetrieveUserProxyAgent(RetrieveUserProxyAgent):
def query_vector_db(
query_texts: List[str],
n_results: int = 10,
search_string: str = "",
) -> Dict[str, Union[List[str], List[List[str]]]]:
# define your own query function here
embed_response = test_embedding('text-embedding-ada-002', input=query_texts)

all_embeddings: List[List[float]] = []

for item in embed_response['data']:

search_queries: List[SearchRequest] = []

for embedding in all_embeddings:

search_response = client.search_batch(
collection_name="{your collection name}",

return {
"ids": [[ for scored_point in batch] for batch in search_response],
"documents": [[scored_point.payload.get('page_content', '') for scored_point in batch] for batch in search_response],
"metadatas": [[scored_point.payload.get('metadata', {}) for scored_point in batch] for batch in search_response]

def retrieve_docs(self, problem: str, n_results: int = 20, search_string: str = "", **kwargs):
results = self.query_vector_db(

self._results = results

# Use QdrantRetrieveUserProxyAgent
qdrantragagent = QdrantRetrieveUserProxyAgent(
"task": "qa",

qdrantragagent.retrieve_docs("What is Autogen?", n_results=10, search_string="autogen")

Advanced Usage of RAG Agents

Integrate with other agents in a group chat

To use RetrieveUserProxyAgent in a group chat is almost the same as you use it in a two agents chat. The only thing is that you need to initialize the chat with RetrieveUserProxyAgent. The RetrieveAssistantAgent is not necessary in a group chat.

However, you may want to initialize the chat with another agent in some cases. To leverage the best of RetrieveUserProxyAgent, you'll need to call it from a function.

boss = autogen.UserProxyAgent(
system_message="The boss who ask questions and give tasks.",

boss_aid = RetrieveUserProxyAgent(
system_message="Assistant who has extra content retrieval power for solving difficult problems.",
"task": "qa",
code_execution_config=False, # we don't want to execute code in this case.

coder = AssistantAgent(
system_message="You are a senior python engineer. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},

pm = autogen.AssistantAgent(
system_message="You are a product manager. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},

reviewer = autogen.AssistantAgent(
system_message="You are a code reviewer. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},

def retrieve_content(
message: Annotated[
"Refined message which keeps the original meaning and can be used to retrieve content for code generation and question answering.",
n_results: Annotated[int, "number of results"] = 3,
) -> str:
boss_aid.n_results = n_results # Set the number of results to be retrieved.
# Check if we need to update the context.
update_context_case1, update_context_case2 = boss_aid._check_update_context(message)
if (update_context_case1 or update_context_case2) and boss_aid.update_context:
boss_aid.problem = message if not hasattr(boss_aid, "problem") else boss_aid.problem
_, ret_msg = boss_aid._generate_retrieve_user_reply(message)
_context = {"problem": message, "n_results": n_results}
ret_msg = boss_aid.message_generator(boss_aid, None, _context)
return ret_msg if ret_msg else message

for caller in [pm, coder, reviewer]:
d_retrieve_content = caller.register_for_llm(
description="retrieve content for code generation and question answering.", api_style="function"

for executor in [boss, pm]:

groupchat = autogen.GroupChat(
agents=[boss, pm, coder, reviewer],

llm_config = {"config_list": config_list, "timeout": 60, "temperature": 0}
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

# Start chatting with the boss as this is the user proxy agent.
message="How to use spark for parallel training in FLAML? Give me sample code.",

Build a Chat application with Gradio

Now, let's wrap it up and make a Chat application with AutoGen and Gradio.

RAG ChatBot with AutoGen

# Initialize Agents
def initialize_agents(config_list, docs_path=None):
return assistant, ragproxyagent

# Initialize Chat
def initiate_chat(config_list, problem, queue, n_results=3):
assistant, problem=problem, silent=False, n_results=n_results
messages = ragproxyagent.chat_messages
messages = [messages[k] for k in messages.keys()][0]
messages = [m["content"] for m in messages if m["role"] == "user"]
print("messages: ", messages)
except Exception as e:
messages = [str(e)]

# Wrap AutoGen part into a function
def chatbot_reply(input_text):
"""Chat with the agent through terminal."""
queue = mp.Queue()
process = mp.Process(
args=(config_list, input_text, queue),
messages = queue.get(timeout=TIMEOUT)
except Exception as e:
messages = [str(e) if len(str(e)) > 0 else "Invalid Request to OpenAI, please check your API keys."]
return messages


# Set up UI with Gradio
with gr.Blocks() as demo:
assistant, ragproxyagent = initialize_agents(config_list)

chatbot = gr.Chatbot(
avatar_images=(None, (os.path.join(os.path.dirname(__file__), "autogen.png"))),
# height=600,

txt_input = gr.Textbox(
placeholder="Enter text and press enter",

with gr.Row():
txt_model = gr.Dropdown(
txt_oai_key = gr.Textbox(
label="OpenAI API Key",
placeholder="Enter key and press enter",
value=os.environ.get("OPENAI_API_KEY", ""),

clear = gr.ClearButton([txt_input, chatbot])


if __name__ == "__main__":

The online app and the source code are hosted in HuggingFace. Feel free to give it a try!

Read More

You can check out more example notebooks for RAG use cases: