Question Generation Configuration
This section provides an overview of the configuration schema for the question generation process, covering input data, sampling, encoding, and model settings. For details on configuring the LLM, see: LLM Configuration.
To create a template configuration file, run:
To generate synthetic queries using your configuration file, run:
For more information about the config init command, see: Config Init CLI
Classes and Fields
InputConfig
Configuration for the input data used in question generation.
| Field | Type | Default | Description |
|---|---|---|---|
dataset_path |
Path |
required | Path to the input dataset file. |
input_type |
InputDataType |
CSV |
The type of the input data (e.g., CSV, JSON). |
text_column |
str |
"text" |
The column containing the text data. |
metadata_columns |
list[str] \| None |
None |
Optional list of columns containing metadata. |
file_encoding |
str |
"utf-8" |
Encoding of the input file. |
QuestionConfig
Configuration for generating standard questions.
| Field | Type | Default | Description |
|---|---|---|---|
num_questions |
int |
20 |
Number of questions to generate per class. |
oversample_factor |
float |
2.0 |
Factor to overgenerate questions before filtering. |
DataLinkedQuestionConfig
Extends QuestionConfig with additional fields for data-linked (multi-hop) question generation. Linked questions combine multiple local questions that share named entities.
| Field | Type | Default | Description |
|---|---|---|---|
min_questions_per_entity |
int |
2 |
Minimum number of local questions required to form an entity group. |
max_questions_per_entity |
int |
3 |
Maximum number of local questions to include per entity group. |
type_balance_weight |
float |
0.5 |
Weight for type-balance penalty in MMR selection. 0=ignore, 1=strong type balancing. |
max_questions_to_generate |
int |
2 |
Maximum number of questions to generate per entity group. |
entity_frequency_threshold |
int |
2 |
Entity must appear in at least this many local questions to be considered. |
ActivityQuestionConfig
Extends QuestionConfig with additional fields for persona-based question generation.
| Field | Type | Default | Description |
|---|---|---|---|
num_personas |
int |
5 |
Number of personas to generate questions for. |
num_tasks_per_persona |
int |
5 |
Number of tasks per persona. |
num_entities_per_task |
int |
10 |
Number of entities per task. |
EncodingModelConfig
Configuration for the encoding model used to chunk documents.
| Field | Type | Default | Description |
|---|---|---|---|
model_name |
str |
"o200k_base" |
Name of the encoding model. |
chunk_size |
int |
600 |
Size of each text chunk. |
chunk_overlap |
int |
100 |
Overlap between consecutive chunks. |
SamplingConfig
Configuration for sampling data from clusters.
| Field | Type | Default | Description |
|---|---|---|---|
num_clusters |
int |
50 |
Number of clusters to sample from. |
num_samples_per_cluster |
int |
10 |
Number of samples per cluster. |
random_seed |
int |
42 |
Seed for reproducibility. |
AssertionConfig
Configuration for assertion generation with separate settings for local, global, and linked questions.
| Field | Type | Default | Description |
|---|---|---|---|
local |
LocalAssertionConfig |
(see below) | Configuration for local assertion generation. |
global |
GlobalAssertionConfig |
(see below) | Configuration for global assertion generation. |
linked |
LinkedAssertionConfig |
(see below) | Configuration for linked assertion generation. |
LocalAssertionConfig
Configuration for local assertion generation.
| Field | Type | Default | Description |
|---|---|---|---|
max_assertions |
int \| None |
20 |
Maximum assertions per question. Set to 0 to disable, or None for unlimited. |
enable_validation |
bool |
True |
Whether to validate assertions against source data. |
min_validation_score |
int |
3 |
Minimum score (1-5) for grounding, relevance, and verifiability. |
max_source_count |
int |
500 |
Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped. |
concurrent_llm_calls |
int |
8 |
Concurrent LLM calls for validation. |
max_concurrent_questions |
int \| None |
8 |
Questions to process in parallel. Set to 1 for sequential. |
GlobalAssertionConfig
Configuration for global assertion generation.
| Field | Type | Default | Description |
|---|---|---|---|
max_assertions |
int \| None |
20 |
Maximum assertions per question. Set to 0 to disable, or None for unlimited. |
enable_validation |
bool |
True |
Whether to validate assertions against source data. |
min_validation_score |
int |
3 |
Minimum score (1-5) for grounding, relevance, and verifiability. |
max_source_count |
int |
500 |
Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped. |
batch_size |
int |
100 |
Batch size for map-reduce claim processing (used when semantic grouping is disabled). |
map_data_tokens |
int |
12000 |
Maximum tokens per cluster in the map step when semantic grouping is enabled. |
reduce_data_tokens |
int |
32000 |
Maximum input tokens for the reduce step. |
enable_semantic_grouping |
bool |
False |
Whether to group similar claims using embedding-based clustering before the map step. |
validate_map_assertions |
bool |
False |
Whether to validate map assertions before the reduce step. Filters low-quality assertions early. |
validate_reduce_assertions |
bool |
True |
Whether to validate final assertions after the reduce step. |
concurrent_llm_calls |
int |
8 |
Concurrent LLM calls for batch processing and validation. |
max_concurrent_questions |
int \| None |
2 |
Questions to process in parallel. Set to 1 for sequential. |
LinkedAssertionConfig
Configuration for linked assertion generation. Uses a direct claim-to-assertion pipeline (no map-reduce).
| Field | Type | Default | Description |
|---|---|---|---|
max_assertions |
int \| None |
20 |
Maximum assertions per question. Set to 0 to disable, or None for unlimited. |
enable_validation |
bool |
True |
Whether to validate assertions against source data. |
min_validation_score |
int |
3 |
Minimum score (1-5) for grounding, relevance, and verifiability. |
max_source_count |
int |
500 |
Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped. |
concurrent_llm_calls |
int |
8 |
Concurrent LLM calls for batch processing and validation. |
max_concurrent_questions |
int \| None |
2 |
Questions to process in parallel. Set to 1 for sequential. |
AssertionPromptConfig
Configuration for assertion generation prompts. Each prompt can be specified as a file path or direct text.
| Field | Type | Default | Description |
|---|---|---|---|
local_assertion_gen_prompt |
PromptConfig |
(default file) | Prompt for generating assertions from local claims. |
global_assertion_map_prompt |
PromptConfig |
(default file) | Prompt for the map step in global assertion generation. |
global_assertion_reduce_prompt |
PromptConfig |
(default file) | Prompt for the reduce step in global assertion generation. |
local_validation_prompt |
PromptConfig |
(default file) | Prompt for validating local assertions (fact-focused) against source data. |
global_validation_prompt |
PromptConfig |
(default file) | Prompt for validating global assertions (theme-focused) against source data. |
QuestionGenerationConfig
Top-level configuration for the entire question generation process.
| Field | Type | Default | Description |
|---|---|---|---|
input |
InputConfig |
required | Input data configuration. |
data_local |
QuestionConfig |
QuestionConfig() |
Local data question generation settings. |
data_global |
DataGlobalQuestionConfig |
DataGlobalQuestionConfig() |
Global data question generation settings. |
data_linked |
DataLinkedQuestionConfig |
DataLinkedQuestionConfig() |
Linked (multi-hop) data question generation settings. |
activity_local |
ActivityQuestionConfig |
ActivityQuestionConfig() |
Local activity question generation. |
activity_global |
ActivityQuestionConfig |
ActivityQuestionConfig() |
Global activity question generation. |
concurrent_requests |
int |
8 |
Number of concurrent model requests. |
encoding |
EncodingModelConfig |
EncodingModelConfig() |
Encoding model configuration. |
sampling |
SamplingConfig |
SamplingConfig() |
Sampling configuration. |
chat_model |
LLMConfig |
LLMConfig() |
LLM configuration for chat. |
embedding_model |
LLMConfig |
LLMConfig() |
LLM configuration for embeddings. |
assertions |
AssertionConfig |
AssertionConfig() |
Assertion generation configuration. |
assertion_prompts |
AssertionPromptConfig |
AssertionPromptConfig() |
Assertion prompt configuration. |
YAML Example
Here is an example of how this configuration might look in a YAML file.
## Input Configuration
input:
dataset_path: ./input
input_type: json
text_column: body_nitf # The column in the dataset that contains the text to be processed. Modify this for your dataset
metadata_columns: [headline, firstcreated] # Additional metadata columns to include in the input. Modify this for your dataset
file_encoding: utf-8-sig
## Encoder configuration
encoding:
model_name: o200k_base
chunk_size: 600
chunk_overlap: 100
## Sampling Configuration
sampling:
num_clusters: 20
num_samples_per_cluster: 10
random_seed: 42
## LLM Configuration
chat_model:
auth_type: api_key
model: gpt-4.1
api_key: ${OPENAI_API_KEY}
llm_provider: openai.chat
embedding_model:
auth_type: api_key
model: text-embedding-3-large
api_key: ${OPENAI_API_KEY}
llm_provider: openai.embedding
## Question Generation Sample Configuration
data_local:
num_questions: 10
oversample_factor: 2.0
data_global:
num_questions: 10
oversample_factor: 2.0
data_linked:
num_questions: 10
oversample_factor: 10.0
min_questions_per_entity: 2
max_questions_per_entity: 3
type_balance_weight: 0.5
max_questions_to_generate: 2
entity_frequency_threshold: 2
activity_local:
num_questions: 10
oversample_factor: 2.0
num_personas: 5
num_tasks_per_persona: 2
num_entities_per_task: 5
activity_global:
num_questions: 10
oversample_factor: 2.0
num_personas: 5
num_tasks_per_persona: 2
num_entities_per_task: 5
## Assertion Generation Configuration
assertions:
local:
max_assertions: 20 # Set to 0 to disable, or null/None for unlimited
enable_validation: true # Enable to filter low-quality assertions
min_validation_score: 3 # Minimum score (1-5) to pass validation
max_source_count: 500 # Max sources per assertion; exceeding drops the question
concurrent_llm_calls: 8 # Concurrent LLM calls for validation
max_concurrent_questions: 8 # Parallel questions for assertion generation. Set to 1 for sequential.
global:
max_assertions: 20
enable_validation: true
min_validation_score: 3
max_source_count: 500 # Max sources per assertion; exceeding drops the question
batch_size: 100 # Batch size for map-reduce processing (when semantic grouping disabled)
map_data_tokens: 12000 # Max tokens per cluster in map step (when semantic grouping enabled)
reduce_data_tokens: 32000 # Max tokens for reduce step
enable_semantic_grouping: false # Set to true to group similar claims together
validate_map_assertions: false # Set to true to validate map assertions before reduce step
validate_reduce_assertions: true # Set to false to skip validation of final assertions
concurrent_llm_calls: 8 # Concurrent LLM calls for batch processing/validation
max_concurrent_questions: 2 # Parallel questions for assertion generation. Set to 1 for sequential.
linked:
max_assertions: 20 # Set to 0 to disable, or null/None for unlimited
enable_validation: true # Enable to filter low-quality assertions
min_validation_score: 3 # Minimum score (1-5) to pass validation
max_source_count: 500 # Max sources per assertion; exceeding drops the question
concurrent_llm_calls: 8 # Concurrent LLM calls for batch processing/validation
max_concurrent_questions: 2 # Parallel questions for assertion generation. Set to 1 for sequential.
assertion_prompts:
local_assertion_gen_prompt:
prompt: prompts/data_questions/assertions/local_claim_assertion_gen_prompt.txt
global_assertion_map_prompt:
prompt: prompts/data_questions/assertions/global_claim_assertion_map_prompt.txt
global_assertion_reduce_prompt:
prompt: prompts/data_questions/assertions/global_claim_assertion_reduce_prompt.txt
local_validation_prompt:
prompt: prompts/data_questions/assertions/local_validation_prompt.txt
global_validation_prompt:
prompt: prompts/data_questions/assertions/global_validation_prompt.txt
💡 Note: The api_key field uses an environment variable reference
${OPENAI_API_KEY}. Make sure to define this variable in a .env file or your environment before running the application.
Assertion Generation
Assertions are testable factual statements derived from extracted claims that can be used as "unit tests" to evaluate the accuracy of RAG system answers. Each question can have multiple assertions that verify specific facts the answer should contain.
How Assertions Work
- Claim Extraction: During question generation, claims (factual statements) are extracted from the source text.
- Assertion Generation: Claims are transformed into testable assertions with clear pass/fail criteria.
- Optional Validation: Assertions can be validated against source data to filter out low-quality assertions.
Assertion Types
- Local Assertions: Generated for
data_localquestions from claims extracted from individual text chunks. - Global Assertions: Generated for
data_globalquestions using a map-reduce approach across multiple source documents.
Validation
When enable_validation is set to true, each assertion is scored on three criteria (1-5 scale):
| Criterion | Description |
|---|---|
| Grounding | Is the assertion factually supported by the source data? |
| Relevance | Is the assertion relevant to the question being asked? |
| Verifiability | Can the assertion be objectively verified from an answer? |
Assertions must meet the min_validation_score threshold on all three criteria to be included.
Controlling Assertion Limits
To disable assertion generation entirely, set max_assertions: 0 for both local and global:
To generate unlimited assertions (no cap), set max_assertions: null:
Providing Prompts: File or Text
Prompts for question generation can be provided in two ways, as defined by the PromptConfig class:
- As a file path: Specify the path to a
.txtfile containing the prompt (recommended for most use cases). - As direct text: Provide the prompt text directly in the configuration.
Only one of these options should be set for each prompt. If both are set, or neither is set, an error will be raised.
Example (File Path)
activity_questions_prompt_config:
activity_local_gen_system_prompt:
prompt: prompts/activity_questions/local/activity_local_gen_system_prompt.txt
Example (Direct Text)
activity_questions_prompt_config:
activity_local_gen_system_prompt:
prompt_text: |
Generate a question about the following activity:
This applies to all prompt fields in QuestionGenerationConfig (including map/reduce, activity question generation, and data question generation prompt configs).
See the PromptConfig class for details.
CLI Reference
This section documents the command-line interface of the BenchmarkQED's AutoQ package.
autoq
No description available
Usage
autoq [OPTIONS] COMMAND [ARGS]...
Arguments
No arguments available
Options
| Name | Description | Required | Default |
|---|---|---|---|
--install-completion |
Install completion for the current shell. | No | - |
--show-completion |
Show completion for the current shell, to copy it or customize the installation. | No | - |
--help |
Show this message and exit. | No | - |
Commands
| Name | Description |
|---|---|
autoq |
Generate questions from the input data. |
assertion-stats |
Generate statistics for assertion files. |
generate-assertions |
Generate assertions for existing questions. |
Sub Commands
autoq autoq
Generate questions from the input data.
Usage
autoq autoq [OPTIONS] CONFIGURATION_PATH OUTPUT_DATA_PATH
Arguments
| Name | Description | Required |
|---|---|---|
CONFIGURATION_PATH |
The path to the file containing the configuration. | Yes |
OUTPUT_DATA_PATH |
The path to the output folder for the results. | Yes |
Options
| Name | Description | Required | Default |
|---|---|---|---|
--generation-types [data_local|data_global|data_linked|activity_local|activity_global] |
The source of the question generation. | No | - |
--print-model-usage / --no-print-model-usage |
Whether to print the model usage statistics after scoring. [default: no-print-model-usage] | No | - |
--help |
Show this message and exit. | No | - |
autoq assertion-stats
Generate statistics for assertion files. Computes and saves statistics including: - Total assertions and questions - Assertions per question (mean, std, min, max) - Sources per assertion (mean, std, min, max) - Supporting assertions per global assertion (mean, std, min, max) - Score distribution Examples
Generate stats for a single assertion file
benchmark-qed autoq assertion-stats output/assertions.json
Generate stats for all assertion files in a directory
benchmark-qed autoq assertion-stats output/data_global_questions/
Specify output path
benchmark-qed autoq assertion-stats assertions.json -o stats/my_stats.json
Usage
autoq assertion-stats [OPTIONS] ASSERTIONS_PATH
Arguments
| Name | Description | Required |
|---|---|---|
ASSERTIONS_PATH |
Path to assertion JSON file or directory containing assertion files. | Yes |
Options
| Name | Description | Required | Default |
|---|---|---|---|
-o, --output PATH |
Path to save stats JSON. If not specified, saves as {input}_stats.json. | No | - |
-t, --type TEXT |
Type of assertions: 'global', 'map', or 'local'. If not specified, inferred from filename. | No | - |
-q, --quiet |
Suppress printing stats to console. | No | - |
--help |
Show this message and exit. | No | - |
autoq generate-assertions
Generate assertions for existing questions. This command loads questions from a JSON file and generates assertions using the specified assertion type configuration from settings.yaml. Examples
Generate local assertions for candidate questions
benchmark-qed autoq generate-assertions settings.yaml \ output/data_local_questions/candidate_questions.json \ output/data_local_questions/ --type local
Generate global assertions
benchmark-qed autoq generate-assertions settings.yaml \ output/data_global_questions/candidate_questions.json \ output/data_global_questions/ --type global
Generate linked assertions
benchmark-qed autoq generate-assertions settings.yaml \ output/data_linked_questions/candidate_questions.json \ output/data_linked_questions/ --type linked
Usage
autoq generate-assertions [OPTIONS] CONFIGURATION_PATH QUESTIONS_PATH OUTPUT_PATH
Arguments
| Name | Description | Required |
|---|---|---|
CONFIGURATION_PATH |
Path to the settings.yaml configuration file. | Yes |
QUESTIONS_PATH |
Path to questions JSON file (e.g., candidate_questions.json). | Yes |
OUTPUT_PATH |
Output directory to save questions with assertions. | Yes |
Options
| Name | Description | Required | Default |
|---|---|---|---|
-t, --type [local|global|linked] |
Type of assertions to generate: 'local', 'global', or 'linked'. [default: local] | No | - |
--print-model-usage / --no-print-model-usage |
Whether to print the model usage statistics. [default: no-print-model-usage] | No | - |
--help |
Show this message and exit. | No | - |