Question Generation Configuration

This section provides an overview of the configuration schema for the question generation process, covering input data, sampling, encoding, and model settings. For details on configuring the LLM, see: LLM Configuration.

To create a template configuration file, run:

benchmark-qed config init autoq local/autoq_test/settings.yaml

To generate synthetic queries using your configuration file, run:

benchmark-qed autoq local/autoq_test/settings.yaml local/autoq_test/output

For more information about the config init command, see: Config Init CLI

Classes and Fields

`InputConfig`

Configuration for the input data used in question generation.

Field	Type	Default	Description
`dataset_path`	`Path`	required	Path to the input dataset file.
`input_type`	`InputDataType`	`CSV`	The type of the input data (e.g., CSV, JSON).
`text_column`	`str`	`"text"`	The column containing the text data.
`metadata_columns`	`list[str] \\| None`	`None`	Optional list of columns containing metadata.
`file_encoding`	`str`	`"utf-8"`	Encoding of the input file.

`QuestionConfig`

Configuration for generating standard questions.

Field	Type	Default	Description
`num_questions`	`int`	`20`	Number of questions to generate per class.
`oversample_factor`	`float`	`2.0`	Factor to overgenerate questions before filtering.

`ActivityQuestionConfig`

Extends QuestionConfig with additional fields for persona-based question generation.

Field	Type	Default	Description
`num_personas`	`int`	`5`	Number of personas to generate questions for.
`num_tasks_per_persona`	`int`	`5`	Number of tasks per persona.
`num_entities_per_task`	`int`	`10`	Number of entities per task.

`EncodingModelConfig`

Configuration for the encoding model used to chunk documents.

Field	Type	Default	Description
`model_name`	`str`	`"o200k_base"`	Name of the encoding model.
`chunk_size`	`int`	`600`	Size of each text chunk.
`chunk_overlap`	`int`	`100`	Overlap between consecutive chunks.

`SamplingConfig`

Configuration for sampling data from clusters.

Field	Type	Default	Description
`num_clusters`	`int`	`50`	Number of clusters to sample from.
`num_samples_per_cluster`	`int`	`10`	Number of samples per cluster.
`random_seed`	`int`	`42`	Seed for reproducibility.

`QuestionGenerationConfig`

Top-level configuration for the entire question generation process.

Field	Type	Default	Description
`input`	`InputConfig`	required	Input data configuration.
`data_local`	`QuestionConfig`	`QuestionConfig()`	Local data question generation settings.
`data_global`	`QuestionConfig`	`QuestionConfig()`	Global data question generation settings.
`activity_local`	`ActivityQuestionConfig`	`ActivityQuestionConfig()`	Local activity question generation.
`activity_global`	`ActivityQuestionConfig`	`ActivityQuestionConfig()`	Global activity question generation.
`concurrent_requests`	`int`	`8`	Number of concurrent model requests.
`encoding`	`EncodingModelConfig`	`EncodingModelConfig()`	Encoding model configuration.
`sampling`	`SamplingConfig`	`SamplingConfig()`	Sampling configuration.
`chat_model`	`LLMConfig`	`LLMConfig()`	LLM configuration for chat.
`embedding_model`	`LLMConfig`	`LLMConfig()`	LLM configuration for embeddings.

YAML Example

Here is an example of how this configuration might look in a YAML file.

## Input Configuration
input:
  dataset_path: ./input
  input_type: json
  text_column: body_nitf # The column in the dataset that contains the text to be processed. Modify this for your dataset
  metadata_columns: [headline, firstcreated] # Additional metadata columns to include in the input. Modify this for your dataset
  file_encoding: utf-8-sig

## Encoder configuration
encoding:
  model_name: o200k_base
  chunk_size: 600
  chunk_overlap: 100

## Sampling Configuration
sampling:
  num_clusters: 20
  num_samples_per_cluster: 10
  random_seed: 42

## LLM Configuration
chat_model:
  auth_type: api_key
  model: gpt-4.1
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.chat
embedding_model:
  auth_type: api_key
  model: text-embedding-3-large
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.embedding

## Question Generation Sample Configuration
data_local:
  num_questions: 10
  oversample_factor: 2.0
data_global:
  num_questions: 10
  oversample_factor: 2.0
activity_local:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5
activity_global:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5

# .env file
OPENAI_API_KEY=your-secret-api-key-here

💡 Note: The api_key field uses an environment variable reference ${OPENAI_API_KEY}. Make sure to define this variable in a .env file or your environment before running the application.

Providing Prompts: File or Text

Prompts for question generation can be provided in two ways, as defined by the PromptConfig class:

As a file path: Specify the path to a .txt file containing the prompt (recommended for most use cases).
As direct text: Provide the prompt text directly in the configuration.

Only one of these options should be set for each prompt. If both are set, or neither is set, an error will be raised.

Example (File Path)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt: prompts/activity_questions/local/activity_local_gen_system_prompt.txt

Example (Direct Text)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt_text: |
      Generate a question about the following activity:

This applies to all prompt fields in QuestionGenerationConfig (including map/reduce, activity question generation, and data question generation prompt configs).

See the PromptConfig class for details.

CLI Reference

This section documents the command-line interface of the BenchmarkQED's AutoQ package.

autoq

Generate questions from the input data.

Usage

autoq [OPTIONS] CONFIGURATION_PATH OUTPUT_DATA_PATH

Arguments

Name	Description	Required
`CONFIGURATION_PATH`	The path to the file containing the configuration.	Yes
`OUTPUT_DATA_PATH`	The path to the output folder for the results.	Yes

Options

Name	Description	Required	Default
`--generation-types [data_local\|data_global\|activity_local\|activity_global]`	The source of the question generation.	No	-
`--print-model-usage / --no-print-model-usage`	Whether to print the model usage statistics after scoring. [default: no-print-model-usage]	No	-
`--install-completion`	Install completion for the current shell.	No	-
`--show-completion`	Show completion for the current shell, to copy it or customize the installation.	No	-
`--help`	Show this message and exit.	No	-

Commands

No commands available