retrieve_utils
UNSTRUCTURED_FORMATS
These formats will be parsed by the 'unstructured' library, if installed.
split_text_to_chunks
def split_text_to_chunks(text: str,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
overlap: int = 0)
Split a long text into chunks of max_tokens.
extract_text_from_pdf
def extract_text_from_pdf(file: str) -> str
Extract text from PDF files
split_files_to_chunks
def split_files_to_chunks(
files: list,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
custom_text_split_function: Callable = None
) -> Tuple[List[str], List[dict]]
Split a list of files into chunks of max_tokens.
get_files_from_dir
def get_files_from_dir(dir_path: Union[str, List[str]],
types: list = TEXT_FORMATS,
recursive: bool = True)
Return a list of all the files in a given directory, a url, a file path or a list of them.
parse_html_to_markdown
def parse_html_to_markdown(html: str, url: str = None) -> str
Parse HTML to markdown.
get_file_from_url
def get_file_from_url(url: str, save_path: str = None) -> Tuple[str, str]
Download a file from a URL.
is_url
def is_url(string: str)
Return True if the string is a valid URL.
create_vector_db_from_dir
def create_vector_db_from_dir(dir_path: Union[str, List[str]],
max_tokens: int = 4000,
client: API = None,
db_path: str = "tmp/chromadb.db",
collection_name: str = "all-my-documents",
get_or_create: bool = False,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
embedding_model: str = "all-MiniLM-L6-v2",
embedding_function: Callable = None,
custom_text_split_function: Callable = None,
custom_text_types: List[str] = TEXT_FORMATS,
recursive: bool = True,
extra_docs: bool = False) -> API
Create a vector db from all the files in a given directory, the directory can also be a single file or a url to a single file. We support chromadb compatible APIs to create the vector db, this function is not required if you prepared your own vector db.
Arguments:
dir_path
Union[str, List[str]] - the path to the directory, file, url or a list of them.max_tokens
Optional, int - the maximum number of tokens per chunk. Default is 4000.client
Optional, API - the chromadb client. Default is None.db_path
Optional, str - the path to the chromadb. Default is "tmp/chromadb.db". The default was/tmp/chromadb.db
for version <=0.2.24.collection_name
Optional, str - the name of the collection. Default is "all-my-documents".get_or_create
Optional, bool - Whether to get or create the collection. Default is False. If True, the collection will be returned if it already exists. Will raise ValueError if the collection already exists and get_or_create is False.chunk_mode
Optional, str - the chunk mode. Default is "multi_lines".must_break_at_empty_line
Optional, bool - Whether to break at empty line. Default is True.embedding_model
Optional, str - the embedding model to use. Default is "all-MiniLM-L6-v2". Will be ignored if embedding_function is not None.embedding_function
Optional, Callable - the embedding function to use. Default is None, SentenceTransformer with the givenembedding_model
will be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples inhttps://docs.trychroma.com/guides/embeddings
.custom_text_split_function
Optional, Callable - a custom function to split a string into a list of strings. Default is None, will use the default function inautogen.retrieve_utils.split_text_to_chunks
.custom_text_types
Optional, List[str] - a list of file types to be processed. Default is TEXT_FORMATS.recursive
Optional, bool - whether to search documents recursively in the dir_path. Default is True.extra_docs
Optional, bool - whether to add more documents in the collection. Default is False
Returns:
The chromadb client.
query_vector_db
def query_vector_db(query_texts: List[str],
n_results: int = 10,
client: API = None,
db_path: str = "tmp/chromadb.db",
collection_name: str = "all-my-documents",
search_string: str = "",
embedding_model: str = "all-MiniLM-L6-v2",
embedding_function: Callable = None) -> QueryResult
Query a vector db. We support chromadb compatible APIs, it's not required if you prepared your own vector db and query function.
Arguments:
query_texts
List[str] - the list of strings which will be used to query the vector db.n_results
Optional, int - the number of results to return. Default is 10.client
Optional, API - the chromadb compatible client. Default is None, a chromadb client will be used.db_path
Optional, str - the path to the vector db. Default is "tmp/chromadb.db". The default was/tmp/chromadb.db
for version <=0.2.24.collection_name
Optional, str - the name of the collection. Default is "all-my-documents".search_string
Optional, str - the search string. Only docs that contain an exact match of this string will be retrieved. Default is "".embedding_model
Optional, str - the embedding model to use. Default is "all-MiniLM-L6-v2". Will be ignored if embedding_function is not None.embedding_function
Optional, Callable - the embedding function to use. Default is None, SentenceTransformer with the givenembedding_model
will be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples inhttps://docs.trychroma.com/guides/embeddings
.
Returns:
The query result. The format is:
class QueryResult(TypedDict):
ids: List[IDs]
embeddings: Optional[List[List[Embedding]]]
documents: Optional[List[List[Document]]]
metadatas: Optional[List[List[Metadata]]]
distances: Optional[List[List[float]]]