Question Generation Configuration

This section provides an overview of the configuration schema for the question generation process, covering input data, sampling, encoding, and model settings. For details on configuring the LLM, see: LLM Configuration.

To create a template configuration file, run:

benchmark-qed config init autoq ./local/autoq_test

To create a template with active Azure Blob Storage configuration:

benchmark-qed config init autoq ./local/autoq_test --storage-type blob --base-dir autoq_test

To generate synthetic queries using your configuration file, run:

benchmark-qed autoq ./local/autoq_test/settings.yaml ./local/autoq_test/output

If your config lives in Azure Blob Storage, pass a blob:// URI and supply credentials inline with --account-url (managed identity) or --connection-string:

benchmark-qed autoq blob://my-container/autoq_test/settings.yaml ./local/output \
  --account-url https://<account>.blob.core.windows.net

benchmark-qed autoq blob://my-container/autoq_test/settings.yaml ./local/output \
  --connection-string "<your-connection-string>"

The same --account-url / --connection-string options are available on autoq generate-assertions. If neither is supplied, auth falls back to $AZURE_STORAGE_ACCOUNT_URL / $AZURE_STORAGE_CONNECTION_STRING.

For more information about the config init command, see: Config Init CLI

Storage Configuration

AutoQ supports reading input data from and writing output to Azure Blob Storage. When using --storage-type blob during config init (optionally with --base-dir), the generated settings file includes active storage sections.

Input Storage

Configured under the input: section in settings.yaml:

input:
  dataset_path: ./input
  input_type: json
  text_column: body_nitf
  file_encoding: utf-8-sig
  storage:
    type: blob
    container_name: my-datasets
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    # account_url: https://<account>.blob.core.windows.net  # Alternative: managed identity
    # base_dir: path/within/container  # Optional prefix path

Output Storage

Configured at the top level in settings.yaml:

output_storage:
  type: blob
  container_name: my-output
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  # account_url: https://<account>.blob.core.windows.net
  # base_dir: path/within/container

Classes and Fields

`InputConfig`

Configuration for the input data used in question generation.

Field	Type	Default	Description
`dataset_path`	`Path`	required	Path to the input dataset file.
`input_type`	`InputDataType`	`CSV`	The type of the input data (e.g., CSV, JSON).
`text_column`	`str`	`"text"`	The column containing the text data.
`metadata_columns`	`list[str] \\| None`	`None`	Optional list of columns containing metadata.
`file_encoding`	`str`	`"utf-8"`	Encoding of the input file.

`QuestionConfig`

Configuration for generating standard questions.

Field	Type	Default	Description
`num_questions`	`int`	`20`	Number of questions to generate per class.
`oversample_factor`	`float`	`2.0`	Factor to overgenerate questions before filtering.

`DataLinkedQuestionConfig`

Extends QuestionConfig with additional fields for data-linked (multi-hop) question generation. Linked questions combine multiple local questions that share named entities.

Field	Type	Default	Description
`min_questions_per_entity`	`int`	`2`	Minimum number of local questions required to form an entity group.
`max_questions_per_entity`	`int`	`3`	Maximum number of local questions to include per entity group.
`type_balance_weight`	`float`	`0.5`	Weight for type-balance penalty in MMR selection. 0=ignore, 1=strong type balancing.
`max_questions_to_generate`	`int`	`2`	Maximum number of questions to generate per entity group.
`entity_frequency_threshold`	`int`	`2`	Entity must appear in at least this many local questions to be considered.

`ActivityQuestionConfig`

Extends QuestionConfig with additional fields for persona-based question generation.

Field	Type	Default	Description
`num_personas`	`int`	`5`	Number of personas to generate questions for.
`num_tasks_per_persona`	`int`	`5`	Number of tasks per persona.
`num_entities_per_task`	`int`	`10`	Number of entities per task.

`EncodingModelConfig`

Configuration for the encoding model used to chunk documents.

Field	Type	Default	Description
`model_name`	`str`	`"o200k_base"`	Name of the encoding model.
`chunk_size`	`int`	`600`	Size of each text chunk.
`chunk_overlap`	`int`	`100`	Overlap between consecutive chunks.

`SamplingConfig`

Configuration for sampling data from clusters.

Field	Type	Default	Description
`num_clusters`	`int`	`50`	Number of clusters to sample from.
`num_samples_per_cluster`	`int`	`10`	Number of samples per cluster.
`random_seed`	`int`	`42`	Seed for reproducibility.

`AssertionConfig`

Configuration for assertion generation with separate settings for local, global, and linked questions.

Field	Type	Default	Description
`local`	`LocalAssertionConfig`	(see below)	Configuration for local assertion generation.
`global`	`GlobalAssertionConfig`	(see below)	Configuration for global assertion generation.
`linked`	`LinkedAssertionConfig`	(see below)	Configuration for linked assertion generation.

`LocalAssertionConfig`

Configuration for local assertion generation.

Field	Type	Default	Description
`max_assertions`	`int \\| None`	`20`	Maximum assertions per question. Set to `0` to disable, or `None` for unlimited.
`enable_validation`	`bool`	`True`	Whether to validate assertions against source data.
`min_validation_score`	`int`	`3`	Minimum score (1-5) for grounding, relevance, and verifiability.
`max_source_count`	`int`	`500`	Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped.
`concurrent_llm_calls`	`int`	`8`	Concurrent LLM calls for validation.
`max_concurrent_questions`	`int \\| None`	`8`	Questions to process in parallel. Set to `1` for sequential.

`GlobalAssertionConfig`

Configuration for global assertion generation.

Field	Type	Default	Description
`max_assertions`	`int \\| None`	`20`	Maximum assertions per question. Set to `0` to disable, or `None` for unlimited.
`enable_validation`	`bool`	`True`	Whether to validate assertions against source data.
`min_validation_score`	`int`	`3`	Minimum score (1-5) for grounding, relevance, and verifiability.
`max_source_count`	`int`	`500`	Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped.
`batch_size`	`int`	`100`	Batch size for map-reduce claim processing (used when semantic grouping is disabled).
`map_data_tokens`	`int`	`12000`	Maximum tokens per cluster in the map step when semantic grouping is enabled.
`reduce_data_tokens`	`int`	`32000`	Maximum input tokens for the reduce step.
`enable_semantic_grouping`	`bool`	`False`	Whether to group similar claims using embedding-based clustering before the map step.
`validate_map_assertions`	`bool`	`False`	Whether to validate map assertions before the reduce step. Filters low-quality assertions early.
`validate_reduce_assertions`	`bool`	`True`	Whether to validate final assertions after the reduce step.
`concurrent_llm_calls`	`int`	`8`	Concurrent LLM calls for batch processing and validation.
`max_concurrent_questions`	`int \\| None`	`2`	Questions to process in parallel. Set to `1` for sequential.

`LinkedAssertionConfig`

Configuration for linked assertion generation. Uses a direct claim-to-assertion pipeline (no map-reduce).

Field	Type	Default	Description
`max_assertions`	`int \\| None`	`20`	Maximum assertions per question. Set to `0` to disable, or `None` for unlimited.
`enable_validation`	`bool`	`True`	Whether to validate assertions against source data.
`min_validation_score`	`int`	`3`	Minimum score (1-5) for grounding, relevance, and verifiability.
`max_source_count`	`int`	`500`	Maximum deduplicated sources per assertion. Questions with assertions exceeding this are dropped.
`concurrent_llm_calls`	`int`	`8`	Concurrent LLM calls for batch processing and validation.
`max_concurrent_questions`	`int \\| None`	`2`	Questions to process in parallel. Set to `1` for sequential.

`AssertionPromptConfig`

Configuration for assertion generation prompts. Each prompt can be specified as a file path or direct text.

Field	Type	Default	Description
`local_assertion_gen_prompt`	`PromptConfig`	(default file)	Prompt for generating assertions from local claims.
`global_assertion_map_prompt`	`PromptConfig`	(default file)	Prompt for the map step in global assertion generation.
`global_assertion_reduce_prompt`	`PromptConfig`	(default file)	Prompt for the reduce step in global assertion generation.
`local_validation_prompt`	`PromptConfig`	(default file)	Prompt for validating local assertions (fact-focused) against source data.
`global_validation_prompt`	`PromptConfig`	(default file)	Prompt for validating global assertions (theme-focused) against source data.

`QuestionGenerationConfig`

Top-level configuration for the entire question generation process.

Field	Type	Default	Description
`input`	`InputConfig`	required	Input data configuration.
`data_local`	`QuestionConfig`	`QuestionConfig()`	Local data question generation settings.
`data_global`	`DataGlobalQuestionConfig`	`DataGlobalQuestionConfig()`	Global data question generation settings.
`data_linked`	`DataLinkedQuestionConfig`	`DataLinkedQuestionConfig()`	Linked (multi-hop) data question generation settings.
`activity_local`	`ActivityQuestionConfig`	`ActivityQuestionConfig()`	Local activity question generation.
`activity_global`	`ActivityQuestionConfig`	`ActivityQuestionConfig()`	Global activity question generation.
`concurrent_requests`	`int`	`8`	Number of concurrent model requests.
`encoding`	`EncodingModelConfig`	`EncodingModelConfig()`	Encoding model configuration.
`sampling`	`SamplingConfig`	`SamplingConfig()`	Sampling configuration.
`chat_model`	`LLMConfig`	`LLMConfig()`	LLM configuration for chat.
`embedding_model`	`LLMConfig`	`LLMConfig()`	LLM configuration for embeddings.
`assertions`	`AssertionConfig`	`AssertionConfig()`	Assertion generation configuration.
`assertion_prompts`	`AssertionPromptConfig`	`AssertionPromptConfig()`	Assertion prompt configuration.

YAML Example

Here is an example of how this configuration might look in a YAML file.

## Input Configuration
input:
  dataset_path: ./input
  input_type: json
  text_column: body_nitf # The column in the dataset that contains the text to be processed. Modify this for your dataset
  metadata_columns: [headline, firstcreated] # Additional metadata columns to include in the input. Modify this for your dataset
  file_encoding: utf-8-sig

## Encoder configuration
encoding:
  model_name: o200k_base
  chunk_size: 600
  chunk_overlap: 100

## Sampling Configuration
sampling:
  num_clusters: 20
  num_samples_per_cluster: 10
  random_seed: 42

## LLM Configuration
chat_model:
  auth_type: api_key
  model: gpt-4.1
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.chat
embedding_model:
  auth_type: api_key
  model: text-embedding-3-large
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.embedding

## Question Generation Sample Configuration
data_local:
  num_questions: 10
  oversample_factor: 2.0
data_global:
  num_questions: 10
  oversample_factor: 2.0
data_linked:
  num_questions: 10
  oversample_factor: 10.0
  min_questions_per_entity: 2
  max_questions_per_entity: 3
  type_balance_weight: 0.5
  max_questions_to_generate: 2
  entity_frequency_threshold: 2
activity_local:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5
activity_global:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5

## Assertion Generation Configuration
assertions:
  local:
    max_assertions: 20  # Set to 0 to disable, or null/None for unlimited
    enable_validation: true  # Enable to filter low-quality assertions
    min_validation_score: 3  # Minimum score (1-5) to pass validation
    max_source_count: 500  # Max sources per assertion; exceeding drops the question
    concurrent_llm_calls: 8  # Concurrent LLM calls for validation
    max_concurrent_questions: 8  # Parallel questions for assertion generation. Set to 1 for sequential.
  global:
    max_assertions: 20
    enable_validation: true
    min_validation_score: 3
    max_source_count: 500  # Max sources per assertion; exceeding drops the question
    batch_size: 100  # Batch size for map-reduce processing (when semantic grouping disabled)
    map_data_tokens: 12000  # Max tokens per cluster in map step (when semantic grouping enabled)
    reduce_data_tokens: 32000  # Max tokens for reduce step
    enable_semantic_grouping: false  # Set to true to group similar claims together
    validate_map_assertions: false  # Set to true to validate map assertions before reduce step
    validate_reduce_assertions: true  # Set to false to skip validation of final assertions
    concurrent_llm_calls: 8  # Concurrent LLM calls for batch processing/validation
    max_concurrent_questions: 2  # Parallel questions for assertion generation. Set to 1 for sequential.
  linked:
    max_assertions: 20  # Set to 0 to disable, or null/None for unlimited
    enable_validation: true  # Enable to filter low-quality assertions
    min_validation_score: 3  # Minimum score (1-5) to pass validation
    max_source_count: 500  # Max sources per assertion; exceeding drops the question
    concurrent_llm_calls: 8  # Concurrent LLM calls for batch processing/validation
    max_concurrent_questions: 2  # Parallel questions for assertion generation. Set to 1 for sequential.

assertion_prompts:
  local_assertion_gen_prompt:
    prompt: prompts/data_questions/assertions/local_claim_assertion_gen_prompt.txt
  global_assertion_map_prompt:
    prompt: prompts/data_questions/assertions/global_claim_assertion_map_prompt.txt
  global_assertion_reduce_prompt:
    prompt: prompts/data_questions/assertions/global_claim_assertion_reduce_prompt.txt
  local_validation_prompt:
    prompt: prompts/data_questions/assertions/local_validation_prompt.txt
  global_validation_prompt:
    prompt: prompts/data_questions/assertions/global_validation_prompt.txt

# .env file
OPENAI_API_KEY=your-secret-api-key-here

💡 Note: The api_key field uses an environment variable reference ${OPENAI_API_KEY}. Make sure to define this variable in a .env file or your environment before running the application.

Assertion Generation

Assertions are testable factual statements derived from extracted claims that can be used as "unit tests" to evaluate the accuracy of RAG system answers. Each question can have multiple assertions that verify specific facts the answer should contain.

How Assertions Work

Claim Extraction: During question generation, claims (factual statements) are extracted from the source text.
Assertion Generation: Claims are transformed into testable assertions with clear pass/fail criteria.
Optional Validation: Assertions can be validated against source data to filter out low-quality assertions.

Assertion Types

Local Assertions: Generated for data_local questions from claims extracted from individual text chunks.
Global Assertions: Generated for data_global questions using a map-reduce approach across multiple source documents.

Validation

When enable_validation is set to true, each assertion is scored on three criteria (1-5 scale):

Criterion	Description
Grounding	Is the assertion factually supported by the source data?
Relevance	Is the assertion relevant to the question being asked?
Verifiability	Can the assertion be objectively verified from an answer?

Assertions must meet the min_validation_score threshold on all three criteria to be included.

Controlling Assertion Limits

To disable assertion generation entirely, set max_assertions: 0 for both local and global:

assertions:
  local:
    max_assertions: 0
  global:
    max_assertions: 0

To generate unlimited assertions (no cap), set max_assertions: null:

assertions:
  local:
    max_assertions: null  # or omit to use default of 20
  global:
    max_assertions: null

Providing Prompts: File or Text

Prompts for question generation can be provided in two ways, as defined by the PromptConfig class:

As a file path: Specify the path to a .txt file containing the prompt (recommended for most use cases).
As direct text: Provide the prompt text directly in the configuration.

Only one of these options should be set for each prompt. If both are set, or neither is set, an error will be raised.

Example (File Path)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt: prompts/activity_questions/local/activity_local_gen_system_prompt.txt

Example (Direct Text)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt_text: |
      Generate a question about the following activity:

This applies to all prompt fields in QuestionGenerationConfig (including map/reduce, activity question generation, and data question generation prompt configs).

See the PromptConfig class for details.

CLI Reference

This section documents the command-line interface of the BenchmarkQED's AutoQ package.

autoq

No description available

Usage

autoq [OPTIONS] COMMAND [ARGS]...

Arguments

No arguments available

Options

Name	Description	Required	Default
`--install-completion`	Install completion for the current shell.	No	-
`--show-completion`	Show completion for the current shell, to copy it or customize the installation.	No	-
`--help`	Show this message and exit.	No	-

Commands

Name	Description
`autoq`	Generate questions from the input data.
`assertion-stats`	Generate statistics for assertion files.
`generate-assertions`	Generate assertions for existing questions.

Subcommands

`autoq autoq`

Generate questions from the input data.

Usage

autoq autoq [OPTIONS] CONFIGURATION_PATH OUTPUT_DATA_PATH

Arguments

Name	Description	Required
`CONFIGURATION_PATH`	The path to the file containing the configuration.	Yes
`OUTPUT_DATA_PATH`	The path to the output folder for the results.	Yes

Options

Name	Description	Required	Default
`--generation-types [data_local\|data_global\|data_linked\|activity_local\|activity_global]`	The source of the question generation.	No	-
`--print-model-usage / --no-print-model-usage`	Whether to print the model usage statistics after scoring.	No	`no-print-model-usage`
`--account-url TEXT`	Azure Blob Storage account URL for managed-identity auth, used when the config path is a blob:// URI. Falls back to $AZURE_STORAGE_ACCOUNT_URL.	No	-
`--connection-string TEXT`	Azure Blob Storage connection string, used when the config path is a blob:// URI. Falls back to $AZURE_STORAGE_CONNECTION_STRING.	No	-
`--help`	Show this message and exit.	No	-

`autoq assertion-stats`

Generate statistics for assertion files.

Computes and saves statistics including: - Total assertions and questions - Assertions per question (mean, std, min, max) - Sources per assertion (mean, std, min, max) - Supporting assertions per global assertion (mean, std, min, max) - Score distribution

Examples

# Generate stats for a single assertion file
benchmark-qed assertion-stats output/assertions.json

# Generate stats for all assertion files in a directory
benchmark-qed assertion-stats output/data_global_questions/

# Specify output path
benchmark-qed assertion-stats assertions.json -o stats/my_stats.json

Usage

autoq assertion-stats [OPTIONS] ASSERTIONS_PATH

Arguments

Name	Description	Required
`ASSERTIONS_PATH`	Path to assertion JSON file or directory containing assertion files.	Yes

Options

Name	Description	Required	Default
`-o, --output PATH`	Path to save stats JSON. If not specified, saves as {input}_stats.json.	No	-
`-t, --type TEXT`	Type of assertions: 'global', 'map', or 'local'. If not specified, inferred from filename.	No	-
`-q, --quiet`	Suppress printing stats to console.	No	-
`--help`	Show this message and exit.	No	-

`autoq generate-assertions`

Generate assertions for existing questions.

This command loads questions from a JSON file and generates assertions using the specified assertion type configuration from settings.yaml.

Examples

# Generate local assertions for candidate questions
benchmark-qed generate-assertions settings.yaml \
    output/data_local_questions/candidate_questions.json \
    output/data_local_questions/ --type local

# Generate global assertions
benchmark-qed generate-assertions settings.yaml \
    output/data_global_questions/candidate_questions.json \
    output/data_global_questions/ --type global

# Generate linked assertions
benchmark-qed generate-assertions settings.yaml \
    output/data_linked_questions/candidate_questions.json \
    output/data_linked_questions/ --type linked

Usage

autoq generate-assertions [OPTIONS] CONFIGURATION_PATH QUESTIONS_PATH OUTPUT_PATH

Arguments

Name	Description	Required
`CONFIGURATION_PATH`	Path to the settings.yaml configuration file.	Yes
`QUESTIONS_PATH`	Path to questions JSON file (e.g., candidate_questions.json).	Yes
`OUTPUT_PATH`	Output directory to save questions with assertions.	Yes

Options

Name	Description	Required	Default
`-t, --type [local\|global\|linked]`	Type of assertions to generate: 'local', 'global', or 'linked'.	No	`local`
`--print-model-usage / --no-print-model-usage`	Whether to print the model usage statistics.	No	`no-print-model-usage`
`--account-url TEXT`	Azure Blob Storage account URL for managed-identity auth, used when the config path is a blob:// URI. Falls back to $AZURE_STORAGE_ACCOUNT_URL.	No	-
`--connection-string TEXT`	Azure Blob Storage connection string, used when the config path is a blob:// URI. Falls back to $AZURE_STORAGE_CONNECTION_STRING.	No	-
`--help`	Show this message and exit.	No	-