Skip to content

Question Generation Configuration

This section provides an overview of the configuration schema for the question generation process, covering input data, sampling, encoding, and model settings. For details on configuring the LLM, see: LLM Configuration.

To create a template configuration file, run:

benchmark-qed config init autoq local/autoq_test/settings.yaml

To generate synthetic queries using your configuration file, run:

benchmark-qed autoq local/autoq_test/settings.yaml local/autoq_test/output

For more information about the config init command, see: Config Init CLI


Classes and Fields

InputConfig

Configuration for the input data used in question generation.

Field Type Default Description
dataset_path Path required Path to the input dataset file.
input_type InputDataType CSV The type of the input data (e.g., CSV, JSON).
text_column str "text" The column containing the text data.
metadata_columns list[str] \| None None Optional list of columns containing metadata.
file_encoding str "utf-8" Encoding of the input file.

QuestionConfig

Configuration for generating standard questions.

Field Type Default Description
num_questions int 20 Number of questions to generate per class.
oversample_factor float 2.0 Factor to overgenerate questions before filtering.

ActivityQuestionConfig

Extends QuestionConfig with additional fields for persona-based question generation.

Field Type Default Description
num_personas int 5 Number of personas to generate questions for.
num_tasks_per_persona int 5 Number of tasks per persona.
num_entities_per_task int 10 Number of entities per task.

EncodingModelConfig

Configuration for the encoding model used to chunk documents.

Field Type Default Description
model_name str "o200k_base" Name of the encoding model.
chunk_size int 600 Size of each text chunk.
chunk_overlap int 100 Overlap between consecutive chunks.

SamplingConfig

Configuration for sampling data from clusters.

Field Type Default Description
num_clusters int 50 Number of clusters to sample from.
num_samples_per_cluster int 10 Number of samples per cluster.
random_seed int 42 Seed for reproducibility.

QuestionGenerationConfig

Top-level configuration for the entire question generation process.

Field Type Default Description
input InputConfig required Input data configuration.
data_local QuestionConfig QuestionConfig() Local data question generation settings.
data_global QuestionConfig QuestionConfig() Global data question generation settings.
activity_local ActivityQuestionConfig ActivityQuestionConfig() Local activity question generation.
activity_global ActivityQuestionConfig ActivityQuestionConfig() Global activity question generation.
concurrent_requests int 8 Number of concurrent model requests.
encoding EncodingModelConfig EncodingModelConfig() Encoding model configuration.
sampling SamplingConfig SamplingConfig() Sampling configuration.
chat_model LLMConfig LLMConfig() LLM configuration for chat.
embedding_model LLMConfig LLMConfig() LLM configuration for embeddings.

YAML Example

Here is an example of how this configuration might look in a YAML file.

## Input Configuration
input:
  dataset_path: ./input
  input_type: json
  text_column: body_nitf # The column in the dataset that contains the text to be processed. Modify this for your dataset
  metadata_columns: [headline, firstcreated] # Additional metadata columns to include in the input. Modify this for your dataset
  file_encoding: utf-8-sig

## Encoder configuration
encoding:
  model_name: o200k_base
  chunk_size: 600
  chunk_overlap: 100

## Sampling Configuration
sampling:
  num_clusters: 20
  num_samples_per_cluster: 10
  random_seed: 42

## LLM Configuration
chat_model:
  auth_type: api_key
  model: gpt-4.1
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.chat
embedding_model:
  auth_type: api_key
  model: text-embedding-3-large
  api_key: ${OPENAI_API_KEY}
  llm_provider: openai.embedding

## Question Generation Sample Configuration
data_local:
  num_questions: 10
  oversample_factor: 2.0
data_global:
  num_questions: 10
  oversample_factor: 2.0
activity_local:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5
activity_global:
  num_questions: 10
  oversample_factor: 2.0
  num_personas: 5
  num_tasks_per_persona: 2
  num_entities_per_task: 5
# .env file
OPENAI_API_KEY=your-secret-api-key-here

💡 Note: The api_key field uses an environment variable reference ${OPENAI_API_KEY}. Make sure to define this variable in a .env file or your environment before running the application.

Providing Prompts: File or Text

Prompts for question generation can be provided in two ways, as defined by the PromptConfig class:

  • As a file path: Specify the path to a .txt file containing the prompt (recommended for most use cases).
  • As direct text: Provide the prompt text directly in the configuration.

Only one of these options should be set for each prompt. If both are set, or neither is set, an error will be raised.

Example (File Path)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt: prompts/activity_questions/local/activity_local_gen_system_prompt.txt

Example (Direct Text)

activity_questions_prompt_config:
  activity_local_gen_system_prompt:
    prompt_text: |
      Generate a question about the following activity:

This applies to all prompt fields in QuestionGenerationConfig (including map/reduce, activity question generation, and data question generation prompt configs).

See the PromptConfig class for details.


CLI Reference

This section documents the command-line interface of the BenchmarkQED's AutoQ package.

autoq

Generate questions from the input data.

Usage

autoq [OPTIONS] CONFIGURATION_PATH OUTPUT_DATA_PATH

Arguments

Name Description Required
CONFIGURATION_PATH The path to the file containing the configuration. Yes
OUTPUT_DATA_PATH The path to the output folder for the results. Yes

Options

Name Description Required Default
--generation-types [data_local|data_global|activity_local|activity_global] The source of the question generation. No -
--print-model-usage / --no-print-model-usage Whether to print the model usage statistics after scoring. [default: no-print-model-usage] No -
--install-completion Install completion for the current shell. No -
--show-completion Show completion for the current shell, to copy it or customize the installation. No -
--help Show this message and exit. No -

Commands

No commands available