Assertion Generation for Existing Questions¶
This notebook demonstrates how to generate assertions for existing data-local and data-global questions that were previously generated without assertions (e.g., when max_assertions=0 was used or assertions were disabled during question generation).
This is useful when you want to retroactively add assertion-based evaluation capabilities to existing question sets.
# Copyright (c) 2025 Microsoft Corporation.
import sys
sys.path.insert(1, "../../../")
%load_ext dotenv
%dotenv
import logging
import os
from pydantic import SecretStr
from benchmark_qed.autoq.io.question import (
load_questions,
save_questions,
)
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen import (
AssertionValidator,
)
from benchmark_qed.config.llm_config import LLMConfig, LLMProvider
from benchmark_qed.llm.factory import ModelFactory
logging.basicConfig(level=logging.INFO)
if logging.getLogger("httpx") is not None:
logging.getLogger("httpx").setLevel(logging.ERROR)
Shared Configuration¶
Common settings for data path and LLM model used by both local and global assertion generation.
# =============================================================================
# SHARED CONFIGURATION - Used by both local and global assertion generation
# =============================================================================
# DATA PATH - where to load/save questions
OUTPUT_QUESTIONS_PATH = "../../local/ap_news/output"
# MODEL CONFIGS
API_KEY = SecretStr(os.getenv("OPENAI_API_KEY", ""))
LLM_MODEL = "gpt-5.2"
LLM_PARAMS = {
"temperature": 0.0,
"seed": 42,
}
# CONCURRENCY - adjust based on your model rate limits
CONCURRENT_REQUESTS = 32
# Create LLM instance
llm = ModelFactory.create_chat_model(
model_config=LLMConfig(
model=LLM_MODEL,
api_key=API_KEY,
llm_provider=LLMProvider.OpenAIChat,
call_args=LLM_PARAMS,
)
)
print(f"LLM configured: {LLM_MODEL}")
print(f"Data path: {OUTPUT_QUESTIONS_PATH}")
Data-Local Assertions¶
Generate assertions for data-local questions. These are factual, single-source questions that require strict grounding validation.
# =============================================================================
# DATA-LOCAL SETTINGS
# =============================================================================
# Maximum assertions per question
LOCAL_MAX_ASSERTIONS = 20
# Validation settings
LOCAL_ENABLE_VALIDATION = True
LOCAL_MIN_VALIDATION_SCORE = (
3 # Minimum score (1-5) for grounding, relevance, verifiability
)
# Parallelism - questions to process in parallel
LOCAL_CONCURRENT_QUESTIONS = 8
# Create local validator (uses local_validation_prompt - stricter, fact-focused)
local_validator = (
AssertionValidator(
llm=llm,
llm_params=LLM_PARAMS,
min_criterion_score=LOCAL_MIN_VALIDATION_SCORE,
concurrent_validations=CONCURRENT_REQUESTS,
)
if LOCAL_ENABLE_VALIDATION
else None
)
if local_validator:
print(f"Local validation enabled (min score: {LOCAL_MIN_VALIDATION_SCORE}/5)")
print(" - Uses local_validation_prompt (fact-focused, strict grounding)")
else:
print("Local validation disabled")
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.local_claim_assertion_gen import (
LocalClaimAssertionGenerator,
)
# Load existing data-local questions from disk
existing_local_questions = load_questions(
f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/selected_questions.json"
)
print(f"Loaded {len(existing_local_questions)} existing data-local questions")
# Filter questions that have claims
questions_with_claims = [
q
for q in existing_local_questions
if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
q
for q in existing_local_questions
if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]
print(
f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)
# Initialize local assertion generator
local_assertion_generator = LocalClaimAssertionGenerator(
llm=llm,
max_assertions=LOCAL_MAX_ASSERTIONS,
validator=local_validator,
max_concurrent_questions=LOCAL_CONCURRENT_QUESTIONS,
)
# Generate assertions for all questions with claims
await local_assertion_generator.agenerate_assertions_for_questions(
questions_with_claims
)
# Combine back with questions that had no claims
updated_local_questions = questions_with_claims + questions_without_claims
# Save updated questions with assertions
save_questions(
updated_local_questions,
f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/",
"selected_questions_with_assertions",
)
# Show summary
print("\n=== SUMMARY ===")
print(f"Processed {len(questions_with_claims)} questions with claims")
total_assertions = sum(
len(q.attributes.get("assertions", [])) for q in questions_with_claims
)
print(f"Total assertions generated: {total_assertions}")
print(
f"Average assertions per question: {total_assertions / max(len(questions_with_claims), 1):.1f}"
)
Data-Global Assertions¶
Generate assertions for data-global questions. These are thematic, cross-document questions that use a map-reduce approach:
- Map step: Generate factual assertions from claim batches (uses semantic grouping)
- Reduce step: Consolidate into high-level thematic assertions
# =============================================================================
# DATA-GLOBAL SETTINGS
# =============================================================================
# Maximum assertions per question
GLOBAL_MAX_ASSERTIONS = 20
# Validation settings
GLOBAL_ENABLE_VALIDATION = True
GLOBAL_MIN_VALIDATION_SCORE = (
3 # Minimum score (1-5) for grounding, relevance, verifiability
)
# Map-reduce batching settings
GLOBAL_BATCH_SIZE = 100 # Batch size when semantic grouping is disabled
GLOBAL_MAP_DATA_TOKENS = (
8000 # Max tokens per cluster in map step (when semantic grouping enabled)
)
GLOBAL_REDUCE_DATA_TOKENS = 32000 # Max input tokens for reduce step
# Semantic grouping - groups similar claims together before map step
GLOBAL_ENABLE_SEMANTIC_GROUPING = True
# Validation phases
GLOBAL_VALIDATE_MAP_ASSERTIONS = (
True # Validate map assertions before reduce (increases LLM calls)
)
GLOBAL_VALIDATE_REDUCE_ASSERTIONS = True # Validate final assertions after reduce
# Parallelism - questions to process in parallel (lower due to internal parallelism)
GLOBAL_CONCURRENT_QUESTIONS = 2
# Create text embedder for semantic grouping
text_embedder = None
if GLOBAL_ENABLE_SEMANTIC_GROUPING:
from benchmark_qed.autod.data_processor.embedding import TextEmbedder
from benchmark_qed.config.llm_config import (
LLMConfig as EmbeddingConfig,
)
from benchmark_qed.config.llm_config import (
LLMProvider as EmbeddingProvider,
)
from benchmark_qed.llm.factory import ModelFactory as EmbeddingModelFactory
embedding_model = EmbeddingModelFactory.create_embedding_model(
model_config=EmbeddingConfig(
model="text-embedding-3-large",
api_key=API_KEY,
llm_provider=EmbeddingProvider.OpenAIEmbedding,
)
)
text_embedder = TextEmbedder(embedding_model)
print("Semantic grouping enabled - similar claims will be grouped together")
# Create map validator (fact-focused, for map assertions)
map_validator = (
AssertionValidator(
llm=llm,
llm_params=LLM_PARAMS,
min_criterion_score=GLOBAL_MIN_VALIDATION_SCORE,
concurrent_validations=CONCURRENT_REQUESTS,
)
if GLOBAL_ENABLE_VALIDATION and GLOBAL_VALIDATE_MAP_ASSERTIONS
else None
)
# Create reduce validator (thematic, for final assertions)
reduce_validator = None
if GLOBAL_ENABLE_VALIDATION and GLOBAL_VALIDATE_REDUCE_ASSERTIONS:
from pathlib import Path
from benchmark_qed.autoq.prompts import data_questions
from benchmark_qed.config.utils import load_template_file
global_validation_prompt = load_template_file(
Path(data_questions.__file__).parent
/ "assertions"
/ "global_validation_prompt.txt"
)
reduce_validator = AssertionValidator(
llm=llm,
llm_params=LLM_PARAMS,
min_criterion_score=GLOBAL_MIN_VALIDATION_SCORE,
concurrent_validations=CONCURRENT_REQUESTS,
validation_prompt=global_validation_prompt,
)
print("\nGlobal assertion settings:")
print(f" - Max assertions: {GLOBAL_MAX_ASSERTIONS}")
print(f" - Map data tokens: {GLOBAL_MAP_DATA_TOKENS}")
print(f" - Reduce data tokens: {GLOBAL_REDUCE_DATA_TOKENS}")
print(f" - Semantic grouping: {GLOBAL_ENABLE_SEMANTIC_GROUPING}")
print(
f" - Map validation: {GLOBAL_VALIDATE_MAP_ASSERTIONS} (min score: {GLOBAL_MIN_VALIDATION_SCORE}/5)"
)
print(
f" - Reduce validation: {GLOBAL_VALIDATE_REDUCE_ASSERTIONS} (min score: {GLOBAL_MIN_VALIDATION_SCORE}/5)"
)
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.global_claim_assertion_gen import (
GlobalClaimAssertionGenerator,
)
# Load existing data-global questions from disk
existing_global_questions = load_questions(
f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/selected_questions.json"
)
print(f"Loaded {len(existing_global_questions)} existing data-global questions")
# Filter questions that have claims
questions_with_claims = [
q
for q in existing_global_questions
if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
q
for q in existing_global_questions
if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]
print(
f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)
# Initialize global assertion generator
global_assertion_generator = GlobalClaimAssertionGenerator(
llm=llm,
max_assertions=GLOBAL_MAX_ASSERTIONS,
batch_size=GLOBAL_BATCH_SIZE,
map_data_tokens=GLOBAL_MAP_DATA_TOKENS,
reduce_data_tokens=GLOBAL_REDUCE_DATA_TOKENS,
concurrent_coroutines=CONCURRENT_REQUESTS,
map_validator=map_validator,
reduce_validator=reduce_validator,
max_concurrent_questions=GLOBAL_CONCURRENT_QUESTIONS,
text_embedder=text_embedder,
enable_semantic_grouping=GLOBAL_ENABLE_SEMANTIC_GROUPING,
validate_map_assertions=GLOBAL_VALIDATE_MAP_ASSERTIONS,
validate_reduce_assertions=GLOBAL_VALIDATE_REDUCE_ASSERTIONS,
)
# Generate assertions for ALL questions with claims
await global_assertion_generator.agenerate_assertions_for_questions(
questions_with_claims
)
# Combine back with questions that had no claims
updated_global_questions = questions_with_claims + questions_without_claims
# Save updated questions with assertions
save_questions(
updated_global_questions,
f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/",
"selected_questions_with_assertions",
)
# Show summary
print("\n=== SUMMARY ===")
print(f"Processed {len(questions_with_claims)} questions with claims")
total_assertions = sum(
len(q.attributes.get("assertions", [])) for q in questions_with_claims
)
total_map_assertions = sum(
len(q.attributes.get("map_assertions", [])) for q in questions_with_claims
)
print(f"Total map assertions generated: {total_map_assertions}")
print(f"Total final assertions generated: {total_assertions}")
print(
f"Average assertions per question: {total_assertions / max(len(questions_with_claims), 1):.1f}"
)
# Show per-question breakdown
print("\n=== PER-QUESTION BREAKDOWN ===")
for i, q in enumerate(questions_with_claims, 1):
assertions = q.attributes.get("assertions", [])
map_assertions = q.attributes.get("map_assertions", [])
print(
f"{i}. [{len(map_assertions)} map → {len(assertions)} final] {q.text[:60]}..."
)
Notes on Assertion Generation¶
When to use this approach:
- You have existing questions that were generated with
max_assertions=0or without assertion generation - You want to add evaluation capabilities to previously generated question sets
- You need to regenerate assertions with different parameters or improved prompts
Input Requirements:
- Questions must have
claimsin theirattributesfield - For data-local questions: claims should be a list of claim dictionaries
- For data-global questions: claims can be in various formats (simple or complex)
Output Format:
- Assertions are added to the question's
attributes.assertionsfield - Each assertion contains a
statementthat can be used for evaluation - Questions without valid claims are left unchanged
Configuration Options:
MAX_ASSERTIONS: Maximum number of assertions to generate per question (default: 20)ENABLE_VALIDATION: Set toTrueto validate final assertions for quality (default: True)MIN_VALIDATION_SCORE: Minimum score (1-5) for validation criteria (default: 3)BATCH_SIZE: For global questions, controls how many claims are processed together (when semantic grouping disabled)MAP_DATA_TOKENS: For global questions, max tokens per cluster in the map step (default: 12000, used when semantic grouping enabled)REDUCE_DATA_TOKENS: For global questions, max input data tokens for the reduce step (default: 32000)CONCURRENT_REQUESTS: Controls parallel LLM calls for batch processing and validation
Semantic Claim Grouping (Global Questions Only):
ENABLE_SEMANTIC_GROUPING: Set toTrueto group similar claims together before the map step- When enabled, claims are embedded and clustered using ConstraintKMeans to ensure similar claims are processed together
- Clusters are constrained by
MAP_DATA_TOKENSto ensure each batch fits within the token limit - This reduces redundancy in map assertions by consolidating semantically similar claims
- Requires a
text_embedderto be provided to theGlobalClaimAssertionGenerator
Map Assertion Validation (Global Questions Only):
VALIDATE_MAP_ASSERTIONS: Set toTrueto validate map assertions before the reduce step (default: False)- When enabled, map assertions are validated using the same criteria as final assertions
- Low-quality map assertions are filtered out before being passed to the reduce step
- Trade-offs:
- Pro: Better quality input to reduce step, fewer low-quality assertions to consolidate
- Con: Increases LLM calls (validation for both map and final assertions)
- Recommended when you have many low-quality claims or want highest quality final assertions
Reduce Assertion Validation (Global Questions Only):
VALIDATE_REDUCE_ASSERTIONS: Set toFalseto skip validation of final assertions (default: True)- When enabled, final assertions are validated after the reduce step
- Useful to disable if you've already validated map assertions and want to save LLM calls
Parallelism Settings:
CONCURRENT_LOCAL_QUESTIONS: Questions to process in parallel for local assertions (default: 8)CONCURRENT_GLOBAL_QUESTIONS: Questions to process in parallel for global assertions (default: 2, lower due to internal parallelism; set to1for sequential)
Validation:
When ENABLE_VALIDATION=True, each assertion is checked for:
- Grounding: Is the assertion factually supported by source texts?
- Relevance: Is the assertion useful for evaluating answers to the question?
- Verifiability: Is the assertion clear and objectively checkable?
Assertions must score at least MIN_VALIDATION_SCORE on all three criteria to pass validation and be included in the final assertion set.
Important: Validator Prompts for Global Assertions Global assertions use a map-reduce approach with different validation requirements:
- Map assertions are factual and specific → use
local_validation_prompt(stricter grounding) - Reduce assertions are thematic and synthesize across sources → use
global_validation_prompt(allows synthesis)
The notebook creates separate validators:
local_validator: For data-local questions and map assertions (fact-focused)global_validator: For reduce/final assertions in data-global questions (thematic)
Using the wrong validation prompt (e.g., local prompt for thematic assertions) will result in very low validation pass rates.