Indexing Architecture
Key Concepts
Knowledge Model
In order to support the GraphRAG system, the outputs of the indexing engine (in the Default Configuration Mode) are aligned to a knowledge model we call the GraphRAG Knowledge Model. This model is designed to be an abstraction over the underlying data storage technology, and to provide a common interface for the GraphRAG system to interact with. In normal use-cases the outputs of the GraphRAG Indexer would be loaded into a database system, and the GraphRAG's Query Engine would interact with the database using the knowledge model data-store types.
Workflows
Because of the complexity of our data indexing tasks, we needed to be able to express our data pipeline as series of multiple, interdependent workflows.
---
title: Sample Workflow DAG
---
stateDiagram-v2
[*] --> Prepare
Prepare --> Chunk
Chunk --> ExtractGraph
Chunk --> EmbedDocuments
ExtractGraph --> GenerateReports
ExtractGraph --> EmbedEntities
ExtractGraph --> EmbedGraph
LLM Caching
The GraphRAG library was designed with LLM interactions in mind, and a common setback when working with LLM APIs is various errors due to network latency, throttling, etc.. Because of these potential error cases, we've added a cache layer around LLM interactions. When completion requests are made using the same input set (prompt and tuning parameters), we return a cached result if one exists. This allows our indexer to be more resilient to network issues, to act idempotently, and to provide a more efficient end-user experience.
Providers & Factories
Several subsystems within GraphRAG use a factory pattern to register and retrieve provider implementations. This allows deep customization to support models, storage, and so on that you may use but isn't built directly into GraphRAG.
The following subsystems use a factory pattern that allows you to register your own implementations:
- language model - implement your own
chat
andembed
methods to use a model provider of choice beyond the built-in OpenAI/Azure support - cache - create your own cache storage location in addition to the file, blob, and CosmosDB ones we provide
- logger - create your own log writing location in addition to the built-in file and blob storage
- storage - create your own storage provider (database, etc.) beyond the file, blob, and CosmosDB ones built in
- vector store - implement your own vector store other than the built-in lancedb, Azure AI Search, and CosmosDB ones built in
- pipeline + workflows - implement your own workflow steps with a custom
run_workflow
function, or register an entire pipeline (list of named workflows)
The links for each of these subsystems point to the source code of the factory, which includes registration of the default built-in implementations. In addition, we have a detailed discussion of language models, which includes and example of a custom provider, and a sample notebook that demonstrates a custom vector store.
All of these factories allow you to register an impl using any string name you would like, even overriding built-in ones directly.