Skip to main content

Use flaml.autogen for Local LLMs

ยท 3 min read
Jiale Liu

TL;DR: We demonstrate how to use flaml.autogen for local LLM application. As an example, we will initiate an endpoint using FastChat and perform inference on ChatGLMv2-6b.

Preparations

Clone FastChat

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly.

git clone https://github.com/lm-sys/FastChat.git
cd FastChat

Download checkpoint

ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion parameters. ChatGLM2-6B is its second-generation version.

Before downloading from HuggingFace Hub, you need to have Git LFS installed.

git clone https://huggingface.co/THUDM/chatglm2-6b

Initiate server

First, launch the controller

python -m fastchat.serve.controller

Then, launch the model worker(s)

python -m fastchat.serve.model_worker --model-path chatglm2-6b

Finally, launch the RESTful API server

python -m fastchat.serve.openai_api_server --host localhost --port 8000

Normally this will work. However, if you encounter error like this, commenting out all the lines containing finish_reason in fastchat/protocol/api_protocal.py and fastchat/protocol/openai_api_protocol.py will fix the problem. The modified code looks like:

class CompletionResponseChoice(BaseModel):
index: int
text: str
logprobs: Optional[int] = None
# finish_reason: Optional[Literal["stop", "length"]]

class CompletionResponseStreamChoice(BaseModel):
index: int
text: str
logprobs: Optional[float] = None
# finish_reason: Optional[Literal["stop", "length"]] = None

Interact with model using oai.Completion

Now the models can be directly accessed through openai-python library as well as flaml.oai.Completion and flaml.oai.ChatCompletion.

from flaml import oai

# create a text completion request
response = oai.Completion.create(
config_list=[
{
"model": "chatglm2-6b",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL", # just a placeholder
}
],
prompt="Hi",
)
print(response)

# create a chat completion request
response = oai.ChatCompletion.create(
config_list=[
{
"model": "chatglm2-6b",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL",
}
],
messages=[{"role": "user", "content": "Hi"}]
)
print(response)

If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s).

interacting with multiple local LLMs

If you would like to interact with multiple LLMs on your local machine, replace the model_worker step above with a multi model variant:

python -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.3 \
--model-names vicuna-7b-v1.3 \
--model-path chatglm2-6b \
--model-names chatglm2-6b

The inference code would be:

from flaml import oai

# create a chat completion request
response = oai.ChatCompletion.create(
config_list=[
{
"model": "chatglm2-6b",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL",
},
{
"model": "vicuna-7b-v1.3",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL",
}
],
messages=[{"role": "user", "content": "Hi"}]
)
print(response)

For Further Reading