TL;DR: We demonstrate how to use flaml.autogen for local LLM application. As an example, we will initiate an endpoint using FastChat and perform inference on ChatGLMv2-6b.
Preparations
Clone FastChat
FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly.
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
Download checkpoint
ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion parameters. ChatGLM2-6B is its second-generation version.
Before downloading from HuggingFace Hub, you need to have Git LFS installed.
git clone https://huggingface.co/THUDM/chatglm2-6b
Initiate server
First, launch the controller
python -m fastchat.serve.controller
Then, launch the model worker(s)
python -m fastchat.serve.model_worker --model-path chatglm2-6b
Finally, launch the RESTful API server
python -m fastchat.serve.openai_api_server --host localhost --port 8000
Normally this will work. However, if you encounter error like this, commenting out all the lines containing finish_reason
in fastchat/protocol/api_protocal.py
and fastchat/protocol/openai_api_protocol.py
will fix the problem. The modified code looks like:
class CompletionResponseChoice(BaseModel):
index: int
text: str
logprobs: Optional[int] = None
# finish_reason: Optional[Literal["stop", "length"]]
class CompletionResponseStreamChoice(BaseModel):
index: int
text: str
logprobs: Optional[float] = None
# finish_reason: Optional[Literal["stop", "length"]] = None
Interact with model using oai.Completion
Now the models can be directly accessed through openai-python library as well as flaml.oai.Completion
and flaml.oai.ChatCompletion
.
from flaml import oai
# create a text completion request
response = oai.Completion.create(
config_list=[
{
"model": "chatglm2-6b",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL", # just a placeholder
}
],
prompt="Hi",
)
print(response)
# create a chat completion request
response = oai.ChatCompletion.create(
config_list=[
{
"model": "chatglm2-6b",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL",
}
],
messages=[{"role": "user", "content": "Hi"}]
)
print(response)
If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s).
interacting with multiple local LLMs
If you would like to interact with multiple LLMs on your local machine, replace the model_worker
step above with a multi model variant:
python -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.3 \
--model-names vicuna-7b-v1.3 \
--model-path chatglm2-6b \
--model-names chatglm2-6b
The inference code would be:
from flaml import oai
# create a chat completion request
response = oai.ChatCompletion.create(
config_list=[
{
"model": "chatglm2-6b",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL",
},
{
"model": "vicuna-7b-v1.3",
"api_base": "http://localhost:8000/v1",
"api_type": "open_ai",
"api_key": "NULL",
}
],
messages=[{"role": "user", "content": "Hi"}]
)
print(response)
For Further Reading
- Documentation about
flaml.autogen
- Documentation about FastChat.