Pairwise Scoring Configuration
This section describes the configuration schema for performing relative comparisons of RAG methods using the LLM-as-a-Judge approach. It includes definitions for conditions, evaluation criteria, and model configuration. For more information about how to configure the LLM, please refer to: LLM Configuration
To create a template configuration file, run:
To perform pairwise scoring with your configuration file, use:
For information about the config init
command, refer to: Config Init CLI
Classes and Fields
Condition
Represents a condition to be evaluated.
Field | Type | Description |
---|---|---|
name |
str |
Name of the condition. |
answer_base_path |
Path |
Path to the JSON file containing the answers for this condition. |
Criteria
Defines a scoring criterion used to evaluate conditions.
Field | Type | Description |
---|---|---|
name |
str |
Name of the criterion. |
description |
str |
Detailed explanation of what the criterion means and how to apply it. |
PairwiseConfig
Top-level configuration for scoring a set of conditions.
Field | Type | Default | Description |
---|---|---|---|
base |
Condition \| None |
None |
The base condition to compare others against. |
others |
list[Condition] |
[] |
List of other conditions to compare. |
question_sets |
list[str] |
[] |
List of question sets to use for scoring. |
criteria |
list[Criteria] |
pairwise_scores_criteria() |
List of criteria to use for scoring. |
llm_config |
LLMConfig |
LLMConfig() |
Configuration for the LLM used in scoring. |
trials |
int |
4 |
Number of trials to run for each condition. |
YAML Example
Below is an example showing how this configuration might be represented in a YAML file. The API key is referenced using an environment variable.
base:
name: vector_rag
answer_base_path: input/vector_rag
others:
- name: lazygraphrag
answer_base_path: input/lazygraphrag
- name: graphrag_global
answer_base_path: input/graphrag_global
question_sets:
- activity_global
- activity_local
# Optional: Custom Evaluation Criteria
# You may define your own list of evaluation criteria here. If this section is omitted, the default criteria will be used.
# criteria:
# - name: "criteria name"
# description: "criteria description"
trials: 4
llm_config:
auth_type: api_key
model: gpt-4.1
api_key: ${OPENAI_API_KEY}
llm_provider: openai.chat
concurrent_requests: 20
💡 Note: The api_key field uses an environment variable reference
${OPENAI_API_KEY}
. Make sure to define this variable in a .env file or your environment before running the application.
Providing Prompts: File or Text
Prompts for pairwise and reference-based scoring can be provided in two ways, as defined by the PromptConfig
class:
- As a file path: Specify the path to a
.txt
file containing the prompt (recommended for most use cases). - As direct text: Provide the prompt text directly in the configuration.
Only one of these options should be set for each prompt. If both are set, or neither is set, an error will be raised.
Example (File Path)
prompt_config:
user_prompt:
prompt: prompts/pairwise_user_prompt.txt
system_prompt:
prompt: prompts/pairwise_system_prompt.txt
Example (Direct Text)
prompt_config:
user_prompt:
prompt_text: |
Please compare the following answers and select the better one.
system_prompt:
prompt_text: |
You are an expert judge for answer quality.
This applies to both PairwiseConfig
and ReferenceConfig
.
See the PromptConfig class for details.
Reference-Based Scoring Configuration
This section explains how to configure reference-based scoring, where generated answers are evaluated against a reference set using the LLM-as-a-Judge approach. It covers the definitions for reference and generated conditions, scoring criteria, and model configuration. For details on LLM configuration, see: LLM Configuration
To create a template configuration file, run:
To perform reference-based scoring with your configuration file, run:
For information about the config init
command, see: Config Init CLI
Classes and Fields
Condition
Represents a condition to be evaluated.
Field | Type | Description |
---|---|---|
name |
str |
Name of the condition. |
answer_base_path |
Path |
Path to the JSON file containing the answers for this condition. |
Criteria
Defines a scoring criterion used to evaluate conditions.
Field | Type | Description |
---|---|---|
name |
str |
Name of the criterion. |
description |
str |
Detailed explanation of what the criterion means and how to apply it. |
ReferenceConfig
Top-level configuration for scoring generated answers against a reference.
Field | Type | Default | Description |
---|---|---|---|
reference |
Condition |
required | The condition containing the reference answers. |
generated |
list[Condition] |
[] |
List of conditions with generated answers to be scored. |
criteria |
list[Criteria] |
reference_scores_criteria() |
List of criteria to use for scoring. |
score_min |
int |
1 |
Minimum score for each criterion. |
score_max |
int |
10 |
Maximum score for each criterion. |
llm_config |
LLMConfig |
LLMConfig() |
Configuration for the LLM used in scoring. |
trials |
int |
4 |
Number of trials to run for each condition. |
YAML Example
Below is an example of how this configuration might be represented in a YAML file. The API key is referenced using an environment variable.
reference:
name: lazygraphrag
answer_base_path: input/lazygraphrag/activity_global.json
generated:
- name: vector_rag
answer_base_path: input/vector_rag/activity_global.json
# Scoring scale
score_min: 1
score_max: 10
# Optional: Custom Evaluation Criteria
# You may define your own list of evaluation criteria here. If this section is omitted, the default criteria will be used.
# criteria:
# - name: "criteria name"
# description: "criteria description"
trials: 4
llm_config:
model: "gpt-4.1"
auth_type: "api_key"
api_key: ${OPENAI_API_KEY}
concurrent_requests: 4
llm_provider: "openai.chat"
init_args: {}
call_args:
temperature: 0.0
seed: 42
💡 Note: The api_key field uses an environment variable reference
${OPENAI_API_KEY}
. Make sure to define this variable in a .env file or your environment before running the application.
Assertion-Based Scoring Configuration
This section describes the configuration schema for evaluating generated answers against predefined assertions using the LLM-as-a-Judge approach. It includes definitions for generated conditions, assertions, and model configuration. For more information about how to configure the LLM, please refer to: LLM Configuration
To create a template configuration file, run:
To perform assertion-based scoring with your configuration file, use:
For information about the config init
command, refer to: Config Init CLI
Classes and Fields
Condition
Represents a condition to be evaluated.
Field | Type | Description |
---|---|---|
name |
str |
Name of the condition. |
answer_base_path |
Path |
Path to the JSON file containing the answers for this condition. |
Assertions
Defines the assertions to be evaluated.
Field | Type | Description |
---|---|---|
assertions_path |
Path |
Path to the JSON file containing the assertions to evaluate. |
AssertionConfig
Top-level configuration for scoring generated answers against assertions.
Field | Type | Default | Description |
---|---|---|---|
generated |
Condition |
required | The condition containing the generated answers to be evaluated. |
assertions |
Assertions |
required | The assertions to use for evaluation. |
pass_threshold |
float |
0.5 |
Threshold for passing the assertion score. |
llm_config |
LLMConfig |
LLMConfig() |
Configuration for the LLM used in scoring. |
trials |
int |
4 |
Number of trials to run for each assertion. |
YAML Example
Below is an example showing how this configuration might be represented in a YAML file. The API key is referenced using an environment variable.
generated:
name: vector_rag
answer_base_path: input/vector_rag/activity_global.json
assertions:
assertions_path: input/assertions.json
# Pass threshold for assertions
pass_threshold: 0.5
trials: 4
llm_config:
auth_type: api_key
model: gpt-4.1
api_key: ${OPENAI_API_KEY}
llm_provider: openai.chat
concurrent_requests: 20
💡 Note: The api_key field uses an environment variable reference
${OPENAI_API_KEY}
. Make sure to define this variable in a .env file or your environment before running the application.📋 Assertions json example:
[ { "question_id": "abc123", "question_text": "What is the capital of France?", "assertions": [ "The response should align with the following ground truth text: Paris is the capital of France.", "The response should be concise and directly answer the question and do not add any additional information." ] } ]
CLI Reference
This section documents the command-line interface of the BenchmarkQED's AutoE package.
autoe
Evaluate Retrieval-Augmented Generation (RAG) methods.
Usage
autoe [OPTIONS] COMMAND [ARGS]...
Arguments
No arguments available
Options
Name | Description | Required | Default |
---|---|---|---|
--install-completion |
Install completion for the current shell. | No | - |
--show-completion |
Show completion for the current shell, to copy it or customize the installation. | No | - |
--help |
Show this message and exit. | No | - |
Commands
Name | Description |
---|---|
pairwise-scores |
Generate scores for the different... |
reference-scores |
Generate scores for the generated answers... |
assertion-scores |
Generate assertion for the generated... |
Sub Commands
autoe pairwise-scores
Generate scores for the different conditions provided in the JSON file.
Usage
autoe pairwise-scores [OPTIONS] COMPARISON_SPEC OUTPUT
Arguments
Name | Description | Required |
---|---|---|
COMPARISON_SPEC |
The path to the JSON file containing the conditions. | Yes |
OUTPUT |
The path to the output file for the scores. | Yes |
Options
Name | Description | Required | Default |
---|---|---|---|
--alpha FLOAT |
The p-value threshold for the significance test. [default: 0.05] | No | - |
--exclude-criteria TEXT |
The criteria to exclude from the scoring. | No | - |
--print-model-usage / --no-print-model-usage |
Whether to print the model usage statistics after scoring. [default: no-print-model-usage] | No | - |
--include-score-id-in-prompt / --no-include-score-id-in-prompt |
Whether to include the score ID in the evaluation prompt for the LLM (might be useful to avoid cached scores). [default: include-score-id-in-prompt] | No | - |
--question-id-key TEXT |
The key in the JSON file that contains the question ID. This is used to match questions across different conditions. [default: question_id] | No | - |
--help |
Show this message and exit. | No | - |
autoe reference-scores
Generate scores for the generated answers provided in the JSON file.
Usage
autoe reference-scores [OPTIONS] COMPARISON_SPEC OUTPUT
Arguments
Name | Description | Required |
---|---|---|
COMPARISON_SPEC |
The path to the JSON file containing the configuration. | Yes |
OUTPUT |
The path to the output file for the scores. | Yes |
Options
Name | Description | Required | Default |
---|---|---|---|
--exclude-criteria TEXT |
The criteria to exclude from the scoring. | No | - |
--print-model-usage / --no-print-model-usage |
Whether to print the model usage statistics after scoring. [default: no-print-model-usage] | No | - |
--include-score-id-in-prompt / --no-include-score-id-in-prompt |
Whether to include the score ID in the evaluation prompt for the LLM (might be useful to avoid cached scores). [default: include-score-id-in-prompt] | No | - |
--question-id-key TEXT |
The key in the JSON file that contains the question ID. This is used to match questions across different conditions. [default: question_id] | No | - |
--help |
Show this message and exit. | No | - |
autoe assertion-scores
Generate assertion for the generated answers provided in the JSON file.
Usage
autoe assertion-scores [OPTIONS] COMPARISON_SPEC OUTPUT
Arguments
Name | Description | Required |
---|---|---|
COMPARISON_SPEC |
The path to the JSON file containing the configuration. | Yes |
OUTPUT |
The path to the output file for the scores. | Yes |
Options
Name | Description | Required | Default |
---|---|---|---|
--print-model-usage / --no-print-model-usage |
Whether to print the model usage statistics after scoring. [default: no-print-model-usage] | No | - |
--include-score-id-in-prompt / --no-include-score-id-in-prompt |
Whether to include the score ID in the evaluation prompt for the LLM (might be useful to avoid cached scores). [default: include-score-id-in-prompt] | No | - |
--question-id-key TEXT |
The key in the JSON file that contains the question ID. This is used to match questions with assertions. [default: question_id] | No | - |
--question-text-key TEXT |
The key in the JSON file that contains the question text. [default: question_text] | No | - |
--answer-text-key TEXT |
The key in the JSON file that contains the answer text. [default: answer] | No | - |
--assertions-key TEXT |
The key in the JSON file that contains the assertions. This should be a list of assertions for each question. [default: assertions] | No | - |
--help |
Show this message and exit. | No | - |