Indexing Dataflow

The GraphRAG Knowledge Model

The knowledge model is a specification for data outputs that conform to our data-model definition. You can find these definitions in the packages/graphrag/graphrag/data_model folder within the GraphRAG repository. The following entity types are provided. The fields here represent the fields that are text-embedded by default.

Document - An input document into the system. These either represent individual rows in a CSV or individual .txt files.
TextUnit - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below.
Entity - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
Relationship - A relationship between two entities.
Covariate - Extracted claim information, which contains statements about entities which may be time-bound.
Community - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
Community Report - The contents of each community are summarized into a generated report, useful for human reading and downstream search.

The Default Configuration Workflow

Let's take a look at how the default-configuration workflow transforms text documents into the GraphRAG Knowledge Model. This page gives a general overview of the major steps in this process. To fully configure this workflow, check out the configuration documentation.

---
title: Dataflow Overview
---
flowchart TB
    subgraph phase1[Phase 1: Compose TextUnits]
    documents[Documents] --> chunk[Chunk]
    chunk --> textUnits[Text Units]
    end
    subgraph phase2[Phase 2: Document Processing]
    documents --> link_to_text_units[Link to TextUnits]
    textUnits --> link_to_text_units
    link_to_text_units --> document_outputs[Documents Table]
    end
    subgraph phase3[Phase 3 Graph Extraction]
    textUnits --> graph_extract[Entity & Relationship Extraction]
    graph_extract --> graph_summarize[Entity & Relationship Summarization]
    graph_summarize --> claim_extraction[Claim Extraction]
    claim_extraction --> graph_outputs[Graph Tables]
    end
    subgraph phase4[Phase 4: Graph Augmentation]
    graph_outputs --> community_detect[Community Detection]
    community_detect --> community_outputs[Communities Table]
    end
    subgraph phase5[Phase 5: Community Summarization]
    community_outputs --> summarized_communities[Community Summarization]
    summarized_communities --> community_report_outputs[Community Reports Table]
    end
    subgraph phase6[Phase 6: Text Embeddings]
    textUnits --> text_embed[Text Embedding]
    graph_outputs --> description_embed[Description Embedding]
    community_report_outputs --> content_embed[Content Embedding]
    end

Phase 1: Compose TextUnits

The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.

The chunk size (counted in tokens), is user-configurable. By default this is set to 1200 tokens. Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.

---
title: Documents into Text Chunks
---
flowchart LR
    doc1[Document 1] --> tu1[TextUnit 1]
    doc1 --> tu2[TextUnit 2]
    doc2[Document 2] --> tu3[TextUnit 3]
    doc2 --> tu4[TextUnit 4]

Phase 2: Document Processing

In this phase of the workflow, we create the Documents table for the knowledge model. Final documents are not used directly in GraphRAG, but this step links them to their constituent text units for provenance in your own applications.

---
title: Document Processing
---
flowchart LR
    aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]

Link to TextUnits

In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.

Documents Table

At this point, we can export the Documents table into the knowledge Model.

Phase 3: Graph Extraction

In this phase, we analyze each text unit and extract our graph primitives: Entities, Relationships, and Claims. Entities and Relationships are extracted at once in our extract_graph workflow, and claims are extracted in our extract_claims workflow. Results are then combined and passed into following phases of the pipeline.

---
title: Graph Extraction
---
flowchart LR
    tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]
    tu --> ce[Claim Extraction]

Note: if you are using the FastGraphRAG option, entity and relationship extraction will be performed using NLP to conserve LLM resources, and claim extraction will always be skipped.

Entity & Relationship Extraction

In this first step of graph extraction, we process each text-unit to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a title, type, and description, and a list of relationships with a source, target, and description.

These subgraphs are merged together - any entities with the same title and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.

Entity & Relationship Summarization

Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.

Claim Extraction (optional)

Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called Covariates.

Note: claim extraction is optional and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.

Phase 4: Graph Augmentation

Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the organization of our graph.

---
title: Graph Augmentation
---
flowchart LR
    cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]

Community Detection

In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.

Graph Tables

Once our graph augmentation steps are complete, the final Entities, Relationships, and Communities tables are exported.

Phase 5: Community Summarization

---
title: Community Summarization
---
flowchart LR
    sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]

At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.

Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.

Generate Community Reports

In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.

Summarize Community Reports

In this step, each community report is then summarized via the LLM for shorthand use.

Community Reports Table

At this point, some bookkeeping work is performed and we export the Community Reports tables.

Phase 6: Text Embedding

For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.

---
title: Text Embedding Workflows
---
flowchart LR
    textUnits[Text Units] --> text_embed[Text Embedding]
    graph_outputs[Graph Tables] --> description_embed[Description Embedding]
    community_report_outputs[Community Reports] --> content_embed[Content Embedding]