Skip to main content



These formats will be parsed by the 'unstructured' library, if installed.


def split_text_to_chunks(text: str,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
overlap: int = 0)

Split a long text into chunks of max_tokens.


def extract_text_from_pdf(file: str) -> str

Extract text from PDF files


def split_files_to_chunks(
files: list,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
custom_text_split_function: Callable = None
) -> Tuple[List[str], List[dict]]

Split a list of files into chunks of max_tokens.


def get_files_from_dir(dir_path: Union[str, List[str]],
types: list = TEXT_FORMATS,
recursive: bool = True)

Return a list of all the files in a given directory, a url, a file path or a list of them.


def parse_html_to_markdown(html: str, url: str = None) -> str

Parse HTML to markdown.


def get_file_from_url(url: str, save_path: str = None) -> Tuple[str, str]

Download a file from a URL.


def is_url(string: str)

Return True if the string is a valid URL.


def create_vector_db_from_dir(dir_path: Union[str, List[str]],
max_tokens: int = 4000,
client: API = None,
db_path: str = "tmp/chromadb.db",
collection_name: str = "all-my-documents",
get_or_create: bool = False,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
embedding_model: str = "all-MiniLM-L6-v2",
embedding_function: Callable = None,
custom_text_split_function: Callable = None,
custom_text_types: List[str] = TEXT_FORMATS,
recursive: bool = True,
extra_docs: bool = False) -> API

Create a vector db from all the files in a given directory, the directory can also be a single file or a url to a single file. We support chromadb compatible APIs to create the vector db, this function is not required if you prepared your own vector db.


  • dir_path Union[str, List[str]] - the path to the directory, file, url or a list of them.
  • max_tokens Optional, int - the maximum number of tokens per chunk. Default is 4000.
  • client Optional, API - the chromadb client. Default is None.
  • db_path Optional, str - the path to the chromadb. Default is "tmp/chromadb.db". The default was /tmp/chromadb.db for version <=0.2.24.
  • collection_name Optional, str - the name of the collection. Default is "all-my-documents".
  • get_or_create Optional, bool - Whether to get or create the collection. Default is False. If True, the collection will be returned if it already exists. Will raise ValueError if the collection already exists and get_or_create is False.
  • chunk_mode Optional, str - the chunk mode. Default is "multi_lines".
  • must_break_at_empty_line Optional, bool - Whether to break at empty line. Default is True.
  • embedding_model Optional, str - the embedding model to use. Default is "all-MiniLM-L6-v2". Will be ignored if embedding_function is not None.
  • embedding_function Optional, Callable - the embedding function to use. Default is None, SentenceTransformer with the given embedding_model will be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples in
  • custom_text_split_function Optional, Callable - a custom function to split a string into a list of strings. Default is None, will use the default function in autogen.retrieve_utils.split_text_to_chunks.
  • custom_text_types Optional, List[str] - a list of file types to be processed. Default is TEXT_FORMATS.
  • recursive Optional, bool - whether to search documents recursively in the dir_path. Default is True.
  • extra_docs Optional, bool - whether to add more documents in the collection. Default is False


The chromadb client.


def query_vector_db(query_texts: List[str],
n_results: int = 10,
client: API = None,
db_path: str = "tmp/chromadb.db",
collection_name: str = "all-my-documents",
search_string: str = "",
embedding_model: str = "all-MiniLM-L6-v2",
embedding_function: Callable = None) -> QueryResult

Query a vector db. We support chromadb compatible APIs, it's not required if you prepared your own vector db and query function.


  • query_texts List[str] - the list of strings which will be used to query the vector db.
  • n_results Optional, int - the number of results to return. Default is 10.
  • client Optional, API - the chromadb compatible client. Default is None, a chromadb client will be used.
  • db_path Optional, str - the path to the vector db. Default is "tmp/chromadb.db". The default was /tmp/chromadb.db for version <=0.2.24.
  • collection_name Optional, str - the name of the collection. Default is "all-my-documents".
  • search_string Optional, str - the search string. Only docs that contain an exact match of this string will be retrieved. Default is "".
  • embedding_model Optional, str - the embedding model to use. Default is "all-MiniLM-L6-v2". Will be ignored if embedding_function is not None.
  • embedding_function Optional, Callable - the embedding function to use. Default is None, SentenceTransformer with the given embedding_model will be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples in


The query result. The format is:

class QueryResult(TypedDict):
ids: List[IDs]
embeddings: Optional[List[List[Embedding]]]
documents: Optional[List[List[Document]]]
metadatas: Optional[List[List[Metadata]]]
distances: Optional[List[List[float]]]