Assertion Generation for Existing Questions¶
This notebook demonstrates how to generate assertions for existing data-local and data-global questions that were previously generated without assertions (e.g., when max_assertions=0 was used or assertions were disabled during question generation).
This is useful when you want to retroactively add assertion-based evaluation capabilities to existing question sets.
# Copyright (c) 2025 Microsoft Corporation.
import sys
sys.path.insert(1, "../../../")
%load_ext dotenv
%dotenv
import logging
import os
from pydantic import SecretStr
from benchmark_qed.autoq.io.question import (
load_questions,
save_questions,
)
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen import (
AssertionValidator,
)
from benchmark_qed.config.llm_config import LLMConfig, LLMProvider
from benchmark_qed.llm.factory import ModelFactory
logging.basicConfig(level=logging.INFO)
if logging.getLogger("httpx") is not None:
logging.getLogger("httpx").setLevel(logging.ERROR)
Configuration¶
# DATA CONFIGS
OUTPUT_QUESTIONS_PATH = "../../output/AP_news/questions"
# MODEL CONFIGS
API_KEY = SecretStr(os.getenv("OPENAI_API_KEY", ""))
LLM_MODEL = "gpt-4.1"
LLM_PARAMS = {
"temperature": 0.0,
"seed": 42,
} # adjust this based on your model. For example, some reasoning models do not support temperature settings
# CONCURRENCY CONFIGS
CONCURRENT_REQUESTS = 8 # Concurrent LLM calls. Adjust based on your model capacity.
# ASSERTION GENERATION CONFIGS
MAX_ASSERTIONS = 20 # Maximum number of assertions per question
BATCH_SIZE = 100 # Batch size for processing claims in global assertion generation
MAX_DATA_TOKENS = (
32000 # Maximum input data tokens for the reduce step in global assertions
)
ENABLE_VALIDATION = True # Set to True to validate assertions against sources
MIN_VALIDATION_SCORE = 3 # Minimum score (1-5) for grounding, relevance, verifiability
# Parallelism for assertion generation (adjust based on your model rate limits)
CONCURRENT_LOCAL_QUESTIONS = 8 # Questions to process in parallel for local assertions
CONCURRENT_GLOBAL_QUESTIONS = 2 # Questions to process in parallel for global assertions (lower due to internal parallelism, set to 1 for sequential)
llm = ModelFactory.create_chat_model(
model_config=LLMConfig(
model=LLM_MODEL,
api_key=API_KEY,
llm_provider=LLMProvider.OpenAIChat,
call_args=LLM_PARAMS,
)
)
# Create validator if validation is enabled
# Validator checks assertions for:
# - Grounding: Is the assertion supported by source texts?
# - Relevance: Is it useful for evaluating the question?
# - Verifiability: Is it clear and testable?
validator = (
AssertionValidator(
llm=llm,
llm_params=LLM_PARAMS,
min_criterion_score=MIN_VALIDATION_SCORE,
concurrent_validations=CONCURRENT_REQUESTS,
)
if ENABLE_VALIDATION
else None
)
if validator:
print(f"Validation enabled (min score: {MIN_VALIDATION_SCORE}/5)")
else:
print("Validation disabled")
Generate Assertions for Existing Data-Local Questions¶
Generate assertions for data-local questions that were previously created without assertions.
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.local_claim_assertion_gen import (
LocalClaimAssertionGenerator,
)
# Load existing data-local questions from disk
# Replace with your actual path to existing questions
existing_local_questions = load_questions(
f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/selected_questions.json"
)
print(f"Loaded {len(existing_local_questions)} existing data-local questions")
# Filter questions that have claims
questions_with_claims = [
q
for q in existing_local_questions
if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
q
for q in existing_local_questions
if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]
print(
f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)
# Initialize local assertion generator with optional validator
local_assertion_generator = LocalClaimAssertionGenerator(
llm=llm,
max_assertions=MAX_ASSERTIONS,
validator=validator, # Pass validator for quality filtering (None to skip)
max_concurrent_questions=CONCURRENT_LOCAL_QUESTIONS, # Process questions in parallel
)
# Generate assertions for all questions with claims (parallel processing)
await local_assertion_generator.agenerate_assertions_for_questions(
questions_with_claims
)
# Combine back with questions that had no claims
updated_local_questions = questions_with_claims + questions_without_claims
# Save updated questions with assertions
save_questions(
updated_local_questions,
f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/",
"selected_questions_with_assertions",
)
print(f"Saved {len(updated_local_questions)} data-local questions with assertions")
Generate Assertions for Existing Data-Global Questions¶
Generate assertions for data-global questions that were previously created without assertions. Global assertion generation uses a map-reduce approach, first generating local assertions from referenced questions, then consolidating them into global assertions.
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.global_claim_assertion_gen import (
GlobalClaimAssertionGenerator,
)
# Load existing data-global questions from disk
# Replace with your actual path to existing questions
existing_global_questions = load_questions(
f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/selected_questions.json"
)
print(f"Loaded {len(existing_global_questions)} existing data-global questions")
# Filter questions that have claims
questions_with_claims = [
q
for q in existing_global_questions
if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
q
for q in existing_global_questions
if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]
print(
f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)
# Initialize global assertion generator with optional validator
global_assertion_generator = GlobalClaimAssertionGenerator(
llm=llm,
max_assertions=MAX_ASSERTIONS,
batch_size=BATCH_SIZE, # Batch size for processing multiple claims
max_data_tokens=MAX_DATA_TOKENS, # max input data tokens for the reduce step
concurrent_coroutines=CONCURRENT_REQUESTS,
validator=validator, # Pass validator for quality filtering (None to skip)
max_concurrent_questions=CONCURRENT_GLOBAL_QUESTIONS, # Process questions in parallel (lower due to internal parallelism)
)
# Generate assertions for all questions with claims (parallel processing)
await global_assertion_generator.agenerate_assertions_for_questions(
questions_with_claims
)
# Combine back with questions that had no claims
updated_global_questions = questions_with_claims + questions_without_claims
# Save updated questions with assertions
save_questions(
updated_global_questions,
f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/",
"selected_questions_with_assertions",
)
print(f"Saved {len(updated_global_questions)} data-global questions with assertions")
Notes on Assertion Generation¶
When to use this approach:
- You have existing questions that were generated with
max_assertions=0or without assertion generation - You want to add evaluation capabilities to previously generated question sets
- You need to regenerate assertions with different parameters or improved prompts
Input Requirements:
- Questions must have
claimsin theirattributesfield - For data-local questions: claims should be a list of claim dictionaries
- For data-global questions: claims can be in various formats (simple or complex)
Output Format:
- Assertions are added to the question's
attributes.assertionsfield - Each assertion contains a
statementthat can be used for evaluation - Questions without valid claims are left unchanged
Configuration Options:
MAX_ASSERTIONS: Maximum number of assertions to generate per question (default: 20)ENABLE_VALIDATION: Set toTrueto validate assertions for quality (default: True)MIN_VALIDATION_SCORE: Minimum score (1-5) for validation criteria (default: 3)BATCH_SIZE: For global questions, controls how many claims are processed togetherMAX_DATA_TOKENS: For global questions, controls the max input data tokens in the reduce stepCONCURRENT_REQUESTS: Controls parallel LLM calls for batch processing and validation
Parallelism Settings:
CONCURRENT_LOCAL_QUESTIONS: Questions to process in parallel for local assertions (default: 8)CONCURRENT_GLOBAL_QUESTIONS: Questions to process in parallel for global assertions (default: 2, lower due to internal parallelism; set to1for sequential)
Validation:
When ENABLE_VALIDATION=True, each assertion is checked for:
- Grounding: Is the assertion factually supported by source texts?
- Relevance: Is the assertion useful for evaluating answers to the question?
- Verifiability: Is the assertion clear and objectively checkable?
Assertions must score at least MIN_VALIDATION_SCORE on all three criteria to pass validation and be included in the final assertion set.