Skip to content

Indexing Architecture

Key Concepts

Knowledge Model

In order to support the GraphRAG system, the outputs of the indexing engine (in the Default Configuration Mode) are aligned to a knowledge model we call the GraphRAG Knowledge Model. This model is designed to be an abstraction over the underlying data storage technology, and to provide a common interface for the GraphRAG system to interact with. In normal use-cases the outputs of the GraphRAG Indexer would be loaded into a database system, and the GraphRAG's Query Engine would interact with the database using the knowledge model data-store types.

Workflows

Because of the complexity of our data indexing tasks, we needed to be able to express our data pipeline as series of multiple, interdependent workflows.

---
title: Sample Workflow DAG
---
stateDiagram-v2
    [*] --> Prepare
    Prepare --> Chunk
    Chunk --> ExtractGraph
    Chunk --> EmbedDocuments
    ExtractGraph --> GenerateReports
    ExtractGraph --> EmbedEntities
    ExtractGraph --> EmbedGraph

LLM Caching

The GraphRAG library was designed with LLM interactions in mind, and a common setback when working with LLM APIs is various errors due to network latency, throttling, etc.. Because of these potential error cases, we've added a cache layer around LLM interactions. When completion requests are made using the same input set (prompt and tuning parameters), we return a cached result if one exists. This allows our indexer to be more resilient to network issues, to act idempotently, and to provide a more efficient end-user experience.

Providers & Factories

Several subsystems within GraphRAG use a factory pattern to register and retrieve provider implementations. This allows deep customization to support models, storage, and so on that you may use but isn't built directly into GraphRAG.

The following subsystems use a factory pattern that allows you to register your own implementations:

  • language model - implement your own chat and embed methods to use a model provider of choice beyond the built-in OpenAI/Azure support
  • cache - create your own cache storage location in addition to the file, blob, and CosmosDB ones we provide
  • logger - create your own log writing location in addition to the built-in file and blob storage
  • storage - create your own storage provider (database, etc.) beyond the file, blob, and CosmosDB ones built in
  • vector store - implement your own vector store other than the built-in lancedb, Azure AI Search, and CosmosDB ones built in
  • pipeline + workflows - implement your own workflow steps with a custom run_workflow function, or register an entire pipeline (list of named workflows)

The links for each of these subsystems point to the source code of the factory, which includes registration of the default built-in implementations. In addition, we have a detailed discussion of language models, which includes and example of a custom provider, and a sample notebook that demonstrates a custom vector store.

All of these factories allow you to register an impl using any string name you would like, even overriding built-in ones directly.