Skip to content

Question Generation Configuration

This section provides an overview of the configuration schema for the question generation process, covering input data, sampling, encoding, and model settings. For details on configuring the LLM, see: LLM Configuration.

To create a template configuration file, run:

benchmark-qed config init autoq local/autoq_test/settings.yaml

To generate synthetic queries using your configuration file, run:

benchmark-qed autoq local/autoq_test/settings.yaml local/autoq_test/output

For more information about the config init command, see: Config Init CLI


Classes and Fields

InputConfig

Configuration for the input data used in question generation.

Field Type Default Description
dataset_path Path required Path to the input dataset file.
input_type InputDataType CSV The type of the input data (e.g., CSV, JSON).
text_column str "text" The column containing the text data.
metadata_columns list[str] \| None None Optional list of columns containing metadata.
file_encoding str "utf-8" Encoding of the input file.

QuestionConfig

Configuration for generating standard questions.

Field Type Default Description
num_questions int 20 Number of questions to generate per class.
oversample_factor float 2.0 Factor to overgenerate questions before filtering.

DataLinkedQuestionConfig

Extends QuestionConfig with additional fields for data-linked (multi-hop) question generation. Linked questions combine multiple local questions that share named entities.

Field Type Default Description
min_questions_per_entity int 2 Minimum number of local questions required to form an entity group.
max_questions_per_entity int 3 Maximum number of local questions to include per entity group.
type_balance_weight float 0.5 Weight for type-balance penalty in MMR selection. 0=ignore, 1=strong type balancing.
max_questions_to_generate int 2 Maximum number of questions to generate per entity group.
entity_frequency_threshold int 2 Entity must appear in at least this many local questions to be considered.

ActivityQuestionConfig

Extends QuestionConfig with additional fields for persona-based question generation.

Field Type Default Description
num_personas int 5 Number of personas to generate questions for.
num_tasks_per_persona int 5 Number of tasks per persona.
num_entities_per_task int 10 Number of entities per task.

EncodingModelConfig

Configuration for the encoding model used to chunk documents.

Field Type Default Description
model_name str "o200k_base" Name of the encoding model.
chunk_size int 600 Size of each text chunk.
chunk_overlap int 100 Overlap between consecutive chunks.

SamplingConfig

Configuration for sampling data from clusters.

Field Type Default Description
num_clusters int 50 Number of clusters to sample from.
num_samples_per_cluster int 10 Number of samples per cluster.
random_seed int 42 Seed for reproducibility.

AssertionConfig

Configuration for assertion generation with separate settings for local, global, and linked questions.

Field Type Default Description
local LocalAssertionConfig (see below) Configuration for local assertion generation.
global GlobalAssertionConfig (see below) Configuration for global assertion generation.
linked LinkedAssertionConfig (see below) Configuration for linked assertion generation.

LocalAssertionConfig

Configuration for local assertion generation.

Field Type Default Description
max_assertions int \| None 20 Maximum assertions per question. Set to 0 to disable, or None for unlimited.
enable_validation bool True Whether to validate assertions against source data.
min_validation_score int 3 Minimum score (1-5) for grounding, relevance, and verifiability.
max_source_count int 500 Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped.
concurrent_llm_calls int 8 Concurrent LLM calls for validation.
max_concurrent_questions int \| None 8 Questions to process in parallel. Set to 1 for sequential.

GlobalAssertionConfig

Configuration for global assertion generation.

Field Type Default Description
max_assertions int \| None 20 Maximum assertions per question. Set to 0 to disable, or None for unlimited.
enable_validation bool True Whether to validate assertions against source data.
min_validation_score int 3 Minimum score (1-5) for grounding, relevance, and verifiability.
max_source_count int 500 Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped.
batch_size int 100 Batch size for map-reduce claim processing (used when semantic grouping is disabled).
map_data_tokens int 12000 Maximum tokens per cluster in the map step when semantic grouping is enabled.
reduce_data_tokens int 32000 Maximum input tokens for the reduce step.
enable_semantic_grouping bool False Whether to group similar claims using embedding-based clustering before the map step.
validate_map_assertions bool False Whether to validate map assertions before the reduce step. Filters low-quality assertions early.
validate_reduce_assertions bool True Whether to validate final assertions after the reduce step.
concurrent_llm_calls int 8 Concurrent LLM calls for batch processing and validation.
max_concurrent_questions int \| None 2 Questions to process in parallel. Set to 1 for sequential.

LinkedAssertionConfig

Configuration for linked assertion generation. Uses a direct claim-to-assertion pipeline (no map-reduce).

Field Type Default Description
max_assertions int \| None 20 Maximum assertions per question. Set to 0 to disable, or None for unlimited.
enable_validation bool True Whether to validate assertions against source data.
min_validation_score int 3 Minimum score (1-5) for grounding, relevance, and verifiability.
max_source_count int 500 Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped.
concurrent_llm_calls int 8 Concurrent LLM calls for batch processing and validation.
max_concurrent_questions int \| None 2 Questions to process in parallel. Set to 1 for sequential.

AssertionPromptConfig

Configuration for assertion generation prompts. Each prompt can be specified as a file path or direct text.

Field Type Default Description
local_assertion_gen_prompt PromptConfig (default file) Prompt for generating assertions from local claims.
global_assertion_map_prompt PromptConfig (default file) Prompt for the map step in global assertion generation.
global_assertion_reduce_prompt PromptConfig (default file) Prompt for the reduce step in global assertion generation.
local_validation_prompt PromptConfig (default file) Prompt for validating local assertions (fact-focused) against source data.
global_validation_prompt PromptConfig (default file) Prompt for validating global assertions (theme-focused) against source data.

QuestionGenerationConfig

Top-level configuration for the entire question generation process.

Field Type Default Description
input InputConfig required Input data configuration.
data_local QuestionConfig QuestionConfig() Local data question generation settings.
data_global DataGlobalQuestionConfig DataGlobalQuestionConfig() Global data question generation settings.
data_linked DataLinkedQuestionConfig DataLinkedQuestionConfig() Linked (multi-hop) data question generation settings.
activity_local ActivityQuestionConfig ActivityQuestionConfig() Local activity question generation.
activity_global ActivityQuestionConfig ActivityQuestionConfig() Global activity question generation.
concurrent_requests int 8 Number of concurrent model requests.
encoding EncodingModelConfig EncodingModelConfig() Encoding model configuration.
sampling SamplingConfig SamplingConfig() Sampling configuration.
chat_model LLMConfig LLMConfig() LLM configuration for chat.
embedding_model LLMConfig LLMConfig() LLM configuration for embeddings.
assertions AssertionConfig AssertionConfig() Assertion generation configuration.
assertion_prompts AssertionPromptConfig AssertionPromptConfig() Assertion prompt configuration.

YAML Example

Here is an example of how this configuration might look in a YAML file.

## Input Configuration
input:
  dataset_path: ./input
  input_type: json
  text_column: body_nitf # The column in the dataset that contains the text to be processed. Modify this for your dataset
  metadata_columns: [headline, firstcreated] # Additional metadata columns to include in the input. Modify this for your dataset
  file_encoding: utf-8-sig

## Encoder configuration
encoding:
  model_name: o200k_base
  chunk_size: 600
  chunk_overlap: 100

## Sampling Configuration
sampling:
  num_clusters: 20
  num_samples_per_cluster: 10
  random_seed: 42

## LLM Configuration
chat_model:
  auth_type: api_key
  model: gpt-4.1
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.chat
embedding_model:
  auth_type: api_key
  model: text-embedding-3-large
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.embedding

## Question Generation Sample Configuration
data_local:
  num_questions: 10
  oversample_factor: 2.0
data_global:
  num_questions: 10
  oversample_factor: 2.0
data_linked:
  num_questions: 10
  oversample_factor: 10.0
  min_questions_per_entity: 2
  max_questions_per_entity: 3
  type_balance_weight: 0.5
  max_questions_to_generate: 2
  entity_frequency_threshold: 2
activity_local:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5
activity_global:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5

## Assertion Generation Configuration
assertions:
  local:
    max_assertions: 20  # Set to 0 to disable, or null/None for unlimited
    enable_validation: true  # Enable to filter low-quality assertions
    min_validation_score: 3  # Minimum score (1-5) to pass validation
    max_source_count: 500  # Max sources per assertion; exceeding drops the question
    concurrent_llm_calls: 8  # Concurrent LLM calls for validation
    max_concurrent_questions: 8  # Parallel questions for assertion generation. Set to 1 for sequential.
  global:
    max_assertions: 20
    enable_validation: true
    min_validation_score: 3
    max_source_count: 500  # Max sources per assertion; exceeding drops the question
    batch_size: 100  # Batch size for map-reduce processing (when semantic grouping disabled)
    map_data_tokens: 12000  # Max tokens per cluster in map step (when semantic grouping enabled)
    reduce_data_tokens: 32000  # Max tokens for reduce step
    enable_semantic_grouping: false  # Set to true to group similar claims together
    validate_map_assertions: false  # Set to true to validate map assertions before reduce step
    validate_reduce_assertions: true  # Set to false to skip validation of final assertions
    concurrent_llm_calls: 8  # Concurrent LLM calls for batch processing/validation
    max_concurrent_questions: 2  # Parallel questions for assertion generation. Set to 1 for sequential.
  linked:
    max_assertions: 20  # Set to 0 to disable, or null/None for unlimited
    enable_validation: true  # Enable to filter low-quality assertions
    min_validation_score: 3  # Minimum score (1-5) to pass validation
    max_source_count: 500  # Max sources per assertion; exceeding drops the question
    concurrent_llm_calls: 8  # Concurrent LLM calls for batch processing/validation
    max_concurrent_questions: 2  # Parallel questions for assertion generation. Set to 1 for sequential.

assertion_prompts:
  local_assertion_gen_prompt:
    prompt: prompts/data_questions/assertions/local_claim_assertion_gen_prompt.txt
  global_assertion_map_prompt:
    prompt: prompts/data_questions/assertions/global_claim_assertion_map_prompt.txt
  global_assertion_reduce_prompt:
    prompt: prompts/data_questions/assertions/global_claim_assertion_reduce_prompt.txt
  local_validation_prompt:
    prompt: prompts/data_questions/assertions/local_validation_prompt.txt
  global_validation_prompt:
    prompt: prompts/data_questions/assertions/global_validation_prompt.txt
# .env file
OPENAI_API_KEY=your-secret-api-key-here

💡 Note: The api_key field uses an environment variable reference ${OPENAI_API_KEY}. Make sure to define this variable in a .env file or your environment before running the application.


Assertion Generation

Assertions are testable factual statements derived from extracted claims that can be used as "unit tests" to evaluate the accuracy of RAG system answers. Each question can have multiple assertions that verify specific facts the answer should contain.

How Assertions Work

  1. Claim Extraction: During question generation, claims (factual statements) are extracted from the source text.
  2. Assertion Generation: Claims are transformed into testable assertions with clear pass/fail criteria.
  3. Optional Validation: Assertions can be validated against source data to filter out low-quality assertions.

Assertion Types

  • Local Assertions: Generated for data_local questions from claims extracted from individual text chunks.
  • Global Assertions: Generated for data_global questions using a map-reduce approach across multiple source documents.

Validation

When enable_validation is set to true, each assertion is scored on three criteria (1-5 scale):

Criterion Description
Grounding Is the assertion factually supported by the source data?
Relevance Is the assertion relevant to the question being asked?
Verifiability Can the assertion be objectively verified from an answer?

Assertions must meet the min_validation_score threshold on all three criteria to be included.

Controlling Assertion Limits

To disable assertion generation entirely, set max_assertions: 0 for both local and global:

assertions:
  local:
    max_assertions: 0
  global:
    max_assertions: 0

To generate unlimited assertions (no cap), set max_assertions: null:

assertions:
  local:
    max_assertions: null  # or omit to use default of 20
  global:
    max_assertions: null

Providing Prompts: File or Text

Prompts for question generation can be provided in two ways, as defined by the PromptConfig class:

  • As a file path: Specify the path to a .txt file containing the prompt (recommended for most use cases).
  • As direct text: Provide the prompt text directly in the configuration.

Only one of these options should be set for each prompt. If both are set, or neither is set, an error will be raised.

Example (File Path)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt: prompts/activity_questions/local/activity_local_gen_system_prompt.txt

Example (Direct Text)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt_text: |
      Generate a question about the following activity:

This applies to all prompt fields in QuestionGenerationConfig (including map/reduce, activity question generation, and data question generation prompt configs).

See the PromptConfig class for details.


CLI Reference

This section documents the command-line interface of the BenchmarkQED's AutoQ package.

autoq

No description available

Usage

autoq [OPTIONS] COMMAND [ARGS]...

Arguments

No arguments available

Options

Name Description Required Default
--install-completion Install completion for the current shell. No -
--show-completion Show completion for the current shell, to copy it or customize the installation. No -
--help Show this message and exit. No -

Commands

Name Description
autoq Generate questions from the input data.
assertion-stats Generate statistics for assertion files.
generate-assertions Generate assertions for existing questions.

Sub Commands

autoq autoq

Generate questions from the input data.

Usage

autoq autoq [OPTIONS] CONFIGURATION_PATH OUTPUT_DATA_PATH

Arguments

Name Description Required
CONFIGURATION_PATH The path to the file containing the configuration. Yes
OUTPUT_DATA_PATH The path to the output folder for the results. Yes

Options

Name Description Required Default
--generation-types [data_local|data_global|data_linked|activity_local|activity_global] The source of the question generation. No -
--print-model-usage / --no-print-model-usage Whether to print the model usage statistics after scoring. [default: no-print-model-usage] No -
--help Show this message and exit. No -

autoq assertion-stats

Generate statistics for assertion files. Computes and saves statistics including: - Total assertions and questions - Assertions per question (mean, std, min, max) - Sources per assertion (mean, std, min, max) - Supporting assertions per global assertion (mean, std, min, max) - Score distribution Examples


Generate stats for a single assertion file

benchmark-qed autoq assertion-stats output/assertions.json

Generate stats for all assertion files in a directory

benchmark-qed autoq assertion-stats output/data_global_questions/

Specify output path

benchmark-qed autoq assertion-stats assertions.json -o stats/my_stats.json

Usage

autoq assertion-stats [OPTIONS] ASSERTIONS_PATH

Arguments

Name Description Required
ASSERTIONS_PATH Path to assertion JSON file or directory containing assertion files. Yes

Options

Name Description Required Default
-o, --output PATH Path to save stats JSON. If not specified, saves as {input}_stats.json. No -
-t, --type TEXT Type of assertions: 'global', 'map', or 'local'. If not specified, inferred from filename. No -
-q, --quiet Suppress printing stats to console. No -
--help Show this message and exit. No -

autoq generate-assertions

Generate assertions for existing questions. This command loads questions from a JSON file and generates assertions using the specified assertion type configuration from settings.yaml. Examples


Generate local assertions for candidate questions

benchmark-qed autoq generate-assertions settings.yaml \ output/data_local_questions/candidate_questions.json \ output/data_local_questions/ --type local

Generate global assertions

benchmark-qed autoq generate-assertions settings.yaml \ output/data_global_questions/candidate_questions.json \ output/data_global_questions/ --type global

Generate linked assertions

benchmark-qed autoq generate-assertions settings.yaml \ output/data_linked_questions/candidate_questions.json \ output/data_linked_questions/ --type linked

Usage

autoq generate-assertions [OPTIONS] CONFIGURATION_PATH QUESTIONS_PATH OUTPUT_PATH

Arguments

Name Description Required
CONFIGURATION_PATH Path to the settings.yaml configuration file. Yes
QUESTIONS_PATH Path to questions JSON file (e.g., candidate_questions.json). Yes
OUTPUT_PATH Output directory to save questions with assertions. Yes

Options

Name Description Required Default
-t, --type [local|global|linked] Type of assertions to generate: 'local', 'global', or 'linked'. [default: local] No -
--print-model-usage / --no-print-model-usage Whether to print the model usage statistics. [default: no-print-model-usage] No -
--help Show this message and exit. No -