Skip to content

GraphRAG Indexing 🤖

The GraphRAG indexing package is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using LLMs.

Indexing Pipelines are configurable. They are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. Our standard pipeline is designed to:

  • extract entities, relationships and claims from raw text
  • perform community detection in entities
  • generate community summaries and reports at multiple levels of granularity
  • embed entities into a graph vector space
  • embed text chunks into a textual vector space

The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API.

Getting Started

Requirements

See the requirements section in Get Started for details on setting up a development environment.

The Indexing Engine can be used in either a default configuration mode or with a custom pipeline. To configure GraphRAG, see the configuration documentation. After you have a config file you can run the pipeline using the CLI or the Python API.

Usage

CLI

# Via Poetry
poetry run poe cli --root <data_root> # default config mode
poetry run poe cli --config your_pipeline.yml # custom config mode

# Via Node
yarn run:index --root <data_root> # default config mode
yarn run:index --config your_pipeline.yml # custom config mode

Python API

Please see the examples folder for a handful of functional pipelines illustrating how to create and run via a custom settings.yml or through custom python scripts.

Further Reading