Default Configuration Mode (using YAML/JSON)
The default configuration mode may be configured by using a settings.yml or settings.json file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax. We initialize with YML by default in graphrag init but you may use the equivalent JSON form if preferred.
Many of these config values have defaults. Rather than replicate them here, please refer to the constants in the code directly.
For example:
# .env
GRAPHRAG_API_KEY=some_api_key
# settings.yml
default_chat_model:
api_key: ${GRAPHRAG_API_KEY}
Config Sections
Language Model Setup
models
This is a set of dicts, one for completion model configuration and one for embedding model configuration. The dict keys are used to reference the model configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them independently in the workflow steps.
For example:
completion_models:
default_completion_model:
model_provider: openai
model: gpt-4.1
auth_method: api_key
api_key: ${GRAPHRAG_API_KEY}
embedding_models:
default_embedding_model:
model_provider: openai
model: text-embedding-3-large
auth_method: api_key
api_key: ${GRAPHRAG_API_KEY}
Fields
typelitellm|mock - The type of LLM provider to use. GraphRAG uses LiteLLM for calling language models.model_providerstr - The model provider to use, e.g., openai, azure, anthropic, etc. LiteLLM is used under the hood which has support for calling 100+ models. View LiteLLm basic usage for details on how models are called (Themodel_provideris the portion prior to/while themodelis the portion following the/). View Language Model Selection for more details and examples on using LiteLLM.modelstr - The model name.call_args: dict[str, Any] - Default arguments to send with every model request. Example,{"n": 5, "max_completion_tokens": 1000, "temperature": 1.5, "organization": "..." }api_keystr|None - The OpenAI API key to use.api_basestr|None - The API base url to use.api_versionstr|None - The API version.auth_methodapi_key|azure_managed_identity - Indicate how you want to authenticate requests.azure_deployment_namestr|None - The deployment name to use if your model is hosted on Azure. Note that if your deployment name on Azure matches the model name, this is unnecessary.- retry RetryConfig|None - Retry settings. default=
None, no retries. - type exponential_backoff|immediate - Type of retry approach. default=
exponential_backoff - max_retries int|None - Max retries to take. default=
7. - base_delay float|None - Base delay when using
exponential_backoff. default=2.0. - jitter bool|None - Add jitter to retry delays when using
exponential_backoff. default=True - max_delay float|None - Maximum retry delay. default=
None, no max. - rate_limit RateLimitConfig|None - Rate limit settings. default=
None, no rate limiting. - type sliding_window - Type of rate limit approach. default=
sliding_window - period_in_seconds int|None - Window size for
sliding_windowrate limiting. default=60, limit requests per minute. - requests_per_period int|None - Maximum number of requests per period. default=
None - tokens_per_period int|None - Maximum number of tokens per period. default=
None - metrics MetricsConfig|None - Metric settings. default=
MetricsConfig(). View metrics notebook for more details on metrics. - type default - The type of
MetricsProcessorservice to use for processing request metrics. default=default - store memory - The type of
MetricsStoreservice. default=memory. - writer log|file - The type of
MetricsWriterto use. Will write out metrics at the end of the process. defaultlog, log metrics out using python standard logging at the end of the process. - log_level int|None - The log level when using
logwriter. default=20, logINFOmessages for metrics. - base_dir str|None - The directory to write metrics to when using
filewriter. default=Path.cwd().
Input Files and Chunking
input
Our pipeline can ingest .csv, .txt, or .json data from an input location. See the inputs page for more details and examples.
Fields
storageStorageConfigtypefile|memory|blob|cosmosdb - The storage type to use. Default=fileencodingstr - The encoding to use for file storage.base_dirstr - The base directory to write output artifacts to, relative to the root.connection_stringstr - (blob/cosmosdb only) The Azure Storage connection string.container_namestr - (blob/cosmosdb only) The Azure Storage container name.account_urlstr - (blob only) The storage account blob URL to use.database_namestr - (cosmosdb only) The database name to use.typetext|csv|json - The type of input data to load. Default istextencodingstr - The encoding of the input file. Default isutf-8file_patternstr - A regex to match input files. Default is.*\.csv$,.*\.txt$, or.*\.json$depending on the specifiedtype, but you can customize it if needed.id_columnstr - The input ID column to use.title_columnstr - The input title column to use.text_columnstr - The input text column to use.
chunking
These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the metadata setting in the input document config, which will replicate document metadata into each chunk.
Fields
typetokens|sentence - The chunking type to use.encoding_modelstr - The text encoding model to use for splitting on token boundaries.sizeint - The max chunk size in tokens.overlapint - The chunk overlap in tokens.prepend_metadatalist[str] - Metadata fields from the source document to prepend on each chunk.
Outputs and Storage
output
This section controls the storage mechanism used by the pipeline used for exporting output tables.
Fields
typefile|memory|blob|cosmosdb - The storage type to use. Default=fileencodingstr - The encoding to use for file storage.base_dirstr - The base directory to write output artifacts to, relative to the root.connection_stringstr - (blob/cosmosdb only) The Azure Storage connection string.container_namestr - (blob/cosmosdb only) The Azure Storage container name.account_urlstr - (blob only) The storage account blob URL to use.database_namestr - (cosmosdb only) The database name to use.typetext|csv|json - The type of input data to load. Default istextencodingstr - The encoding of the input file. Default isutf-8
update_output_storage
The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.
Fields
typefile|memory|blob|cosmosdb - The storage type to use. Default=fileencodingstr - The encoding to use for file storage.base_dirstr - The base directory to write output artifacts to, relative to the root.connection_stringstr - (blob/cosmosdb only) The Azure Storage connection string.container_namestr - (blob/cosmosdb only) The Azure Storage container name.account_urlstr - (blob only) The storage account blob URL to use.database_namestr - (cosmosdb only) The database name to use.typetext|csv|json - The type of input data to load. Default istextencodingstr - The encoding of the input file. Default isutf-8
cache
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results for faster performance when re-running the indexing process.
Fields
typejson|memory|none - The storage type to use. Default=jsonstorageStorageConfigtypefile|memory|blob|cosmosdb - The storage type to use. Default=fileencodingstr - The encoding to use for file storage.base_dirstr - The base directory to write output artifacts to, relative to the root.connection_stringstr - (blob/cosmosdb only) The Azure Storage connection string.container_namestr - (blob/cosmosdb only) The Azure Storage container name.account_urlstr - (blob only) The storage account blob URL to use.database_namestr - (cosmosdb only) The database name to use.
reporting
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.
Fields
typefile|blob - The reporting type to use. Default=filebase_dirstr - The base directory to write reports to, relative to the root.connection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.account_urlstr - The storage account blob URL to use.
vector_store
Where to put all vectors for the system. Configured for lancedb by default. This is a dict, with the key used to identify individual store parameters (e.g., for text embedding).
Fields
typelancedb|azure_ai_search|cosmosdb - Type of vector store. Default=lancedbdb_uristr (lancedb only) - The database uri. Default=storage.base_dir/lancedburlstr (blob/cosmosdb only) - Database / AI Search to be used.api_keystr (optional - AI Search only) - The AI Search api key to use.audiencestr (AI Search only) - Audience for managed identity token if managed identity authentication is used.connection_stringstr - (cosmosdb only) The Azure Storage connection string.-
database_namestr - (cosmosdb only) Name of the database. -
index_schemadict[str, dict[str, str]] (optional) - Enables customization for each of your embeddings. <supported_embedding>:index_namestr: (optional) - Name for the specific embedding index table.id_fieldstr: (optional) - Field name to be used as id. Default=idvector_fieldstr: (optional) - Field name to be used as vector. Default=vectorvector_sizeint: (optional) - Vector size for the embeddings. Default=3072
The supported embeddings are:
text_unit_textentity_descriptioncommunity_full_content
For example:
vector_store:
type: lancedb
db_uri: output/lancedb
index_schema:
text_unit_text:
index_name: "text-unit-embeddings"
id_field: "id_custom"
vector_field: "vector_custom"
vector_size: 3072
entity_description:
id_field: "id_custom"
Workflow Configurations
These settings control each individual workflow as they execute.
workflows
list[str] - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.
embed_text
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the target and names fields.
Supported embeddings names are:
text_unit_textentity_descriptioncommunity_full_content
Fields
embedding_model_idstr - Name of the model definition to use for text embedding.model_instance_namestr - Name of the model singleton instance. Default is "text_embedding". This primarily affects the cache storage partitioning.batch_sizeint - The maximum batch size to use.batch_max_tokensint - The maximum batch # of tokens.nameslist[str] - List of the embeddings names to run (must be in supported list).
extract_graph
Tune the language model-based graph extraction process.
Fields
completion_model_idstr - Name of the model definition to use for API calls.model_instance_namestr - Name of the model singleton instance. Default is "extract_graph". This primarily affects the cache storage partitioning.promptstr - The prompt file to use.entity_typeslist[str] - The entity types to identify.max_gleaningsint - The maximum number of gleaning cycles to use.
summarize_descriptions
Fields
completion_model_idstr - Name of the model definition to use for API calls.model_instance_namestr - Name of the model singleton instance. Default is "summarize_descriptions". This primarily affects the cache storage partitioning.promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per summarization.max_input_lengthint - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
extract_graph_nlp
Defines settings for NLP-based graph extraction methods.
Fields
normalize_edge_weightsbool - Whether to normalize the edge weights during graph construction. Default=True.concurrent_requestsint - The number of threads to use for the extraction process.async_modeasyncio|threaded - The async mode to use. Eitherasyncioorthreaded.text_analyzerdict - Parameters for the NLP model.extractor_typeregex_english|syntactic_parser|cfg - Default=regex_english.model_namestr - Name of NLP model (for SpaCy-based models)max_word_lengthint - Longest word to allow. Default=15.word_delimiterstr - Delimiter to split words. Default ' '.include_named_entitiesbool - Whether to include named entities in noun phrases. Default=True.exclude_nounslist[str] | None - List of nouns to exclude. IfNone, we use an internal stopword list.exclude_entity_tagslist[str] - List of entity tags to ignore.exclude_pos_tagslist[str] - List of part-of-speech tags to ignore.noun_phrase_tagslist[str] - List of noun phrase tags to ignore.noun_phrase_grammarsdict[str, str] - Noun phrase grammars for the model (cfg-only).
prune_graph
Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.
Fields
min_node_freqint - The minimum node frequency to allow.max_node_freq_stdfloat | None - The maximum standard deviation of node frequency to allow.min_node_degreeint - The minimum node degree to allow.max_node_degree_stdfloat | None - The maximum standard deviation of node degree to allow.min_edge_weight_pctfloat - The minimum edge weight percentile to allow.remove_ego_nodesbool - Remove ego nodes.lcc_onlybool - Only use largest connected component.
cluster_graph
These are the settings used for Leiden hierarchical clustering of the graph to create communities.
Fields
max_cluster_sizeint - The maximum cluster size to export.use_lccbool - Whether to only use the largest connected component.seedint - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
extract_claims
Fields
enabledbool - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.completion_model_idstr - Name of the model definition to use for API calls.model_instance_namestr - Name of the model singleton instance. Default is "extract_claims". This primarily affects the cache storage partitioning.promptstr - The prompt file to use.descriptionstr - Describes the types of claims we want to extract.max_gleaningsint - The maximum number of gleaning cycles to use.
community_reports
Fields
completion_model_idstr - Name of the model definition to use for API calls.model_instance_namestr - Name of the model singleton instance. Default is "community_reporting". This primarily affects the cache storage partitioning.graph_promptstr | None - The community report extraction prompt to use for graph-based summarization.text_promptstr | None - The community report extraction prompt to use for text-based summarization.max_lengthint - The maximum number of output tokens per report.max_input_lengthint - The maximum number of input tokens to use when generating reports.
snapshots
Fields
embeddingsbool - Export embeddings snapshots to parquet.graphmlbool - Export graph snapshot to GraphML.raw_graphbool - Export raw extracted graph before merging.
Query
local_search
Fields
promptstr - The prompt file to use.completion_model_idstr - Name of the model definition to use for Chat Completion calls.embedding_model_idstr - Name of the model definition to use for Embedding calls.text_unit_propfloat - The text unit proportion.community_propfloat - The community proportion.conversation_history_max_turnsint - The conversation history maximum turns.top_k_entitiesint - The top k mapped entities.top_k_relationshipsint - The top k mapped relations.max_context_tokensint - The maximum tokens to use building the request context.
global_search
Fields
map_promptstr - The global search mapper prompt to use.reduce_promptstr - The global search reducer to use.completion_model_idstr - Name of the model definition to use for Chat Completion calls.knowledge_promptstr - The knowledge prompt file to use.data_max_tokensint - The maximum tokens to use constructing the final response from the reduces responses.map_max_lengthint - The maximum length to request for map responses, in words.reduce_max_lengthint - The maximum length to request for reduce responses, in words.dynamic_search_thresholdint - Rating threshold in include a community report.dynamic_search_keep_parentbool - Keep parent community if any of the child communities are relevant.dynamic_search_num_repeatsint - Number of times to rate the same community report.dynamic_search_use_summarybool - Use community summary instead of full_context.dynamic_search_max_levelint - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
drift_search
Fields
promptstr - The prompt file to use.reduce_promptstr - The reducer prompt file to use.completion_model_idstr - Name of the model definition to use for Chat Completion calls.embedding_model_idstr - Name of the model definition to use for Embedding calls.data_max_tokensint - The data llm maximum tokens.reduce_max_tokensint - The maximum tokens for the reduce phase. Only use if a non-o-series model.reduce_temperaturefloat - The temperature to use for token generation in reduce.reduce_max_completion_tokensint - The maximum tokens for the reduce phase. Only use for o-series models.concurrencyint - The number of concurrent requests.drift_k_followupsint - The number of top global results to retrieve.primer_foldsint - The number of folds for search priming.primer_llm_max_tokensint - The maximum number of tokens for the LLM in primer.n_depthint - The number of drift search steps to take.local_search_text_unit_propfloat - The proportion of search dedicated to text units.local_search_community_propfloat - The proportion of search dedicated to community properties.local_search_top_k_mapped_entitiesint - The number of top K entities to map during local search.local_search_top_k_relationshipsint - The number of top K relationships to map during local search.local_search_max_data_tokensint - The maximum context size in tokens for local search.local_search_temperaturefloat - The temperature to use for token generation in local search.local_search_top_pfloat - The top-p value to use for token generation in local search.local_search_nint - The number of completions to generate in local search.local_search_llm_max_gen_tokensint - The maximum number of generated tokens for the LLM in local search. Only use if a non-o-series model.local_search_llm_max_gen_completion_tokensint - The maximum number of generated tokens for the LLM in local search. Only use for o-series models.
basic_search
Fields
promptstr - The prompt file to use.completion_model_idstr - Name of the model definition to use for Chat Completion calls.embedding_model_idstr - Name of the model definition to use for Embedding calls.kint - Number of text units to retrieve from the vector store for context building.max_context_tokensint - The maximum context size to create, in tokens.