Assertion Generation for Existing Questions¶

This notebook demonstrates how to generate assertions for existing data-local and data-global questions that were previously generated without assertions (e.g., when max_assertions=0 was used or assertions were disabled during question generation).

This is useful when you want to retroactively add assertion-based evaluation capabilities to existing question sets.

In [ ]:

Copied!

# Copyright (c) 2025 Microsoft Corporation.

import sys

sys.path.insert(1, "../../../")
# Copyright (c) 2025 Microsoft Corporation.

import sys

sys.path.insert(1, "../../../")

In [ ]:

Copied!

%load_ext dotenv
%dotenv
%load_ext dotenv
%dotenv

In [ ]:

Copied!





import logging
import os

from pydantic import SecretStr

from benchmark_qed.autoq.io.question import (
    load_questions,
    save_questions,
)
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen import (
    AssertionValidator,
)
from benchmark_qed.config.llm_config import LLMConfig, LLMProvider
from benchmark_qed.llm.factory import ModelFactory

logging.basicConfig(level=logging.INFO)

if logging.getLogger("httpx") is not None:
    logging.getLogger("httpx").setLevel(logging.ERROR)
import logging
import os

from pydantic import SecretStr

from benchmark_qed.autoq.io.question import (
    load_questions,
    save_questions,
)
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen import (
    AssertionValidator,
)
from benchmark_qed.config.llm_config import LLMConfig, LLMProvider
from benchmark_qed.llm.factory import ModelFactory

logging.basicConfig(level=logging.INFO)

if logging.getLogger("httpx") is not None:
    logging.getLogger("httpx").setLevel(logging.ERROR)

Configuration¶

In [ ]:

Copied!





# DATA CONFIGS
OUTPUT_QUESTIONS_PATH = "../../output/AP_news/questions"

# MODEL CONFIGS
API_KEY = SecretStr(os.getenv("OPENAI_API_KEY", ""))
LLM_MODEL = "gpt-4.1"
LLM_PARAMS = {
    "temperature": 0.0,
    "seed": 42,
}  # adjust this based on your model. For example, some reasoning models do not support temperature settings

# CONCURRENCY CONFIGS
CONCURRENT_REQUESTS = 8  # Concurrent LLM calls. Adjust based on your model capacity.

# ASSERTION GENERATION CONFIGS
MAX_ASSERTIONS = 20  # Maximum number of assertions per question
BATCH_SIZE = 100  # Batch size for processing claims in global assertion generation
MAX_DATA_TOKENS = (
    32000  # Maximum input data tokens for the reduce step in global assertions
)
ENABLE_VALIDATION = True  # Set to True to validate assertions against sources
MIN_VALIDATION_SCORE = 3  # Minimum score (1-5) for grounding, relevance, verifiability

# Parallelism for assertion generation (adjust based on your model rate limits)
CONCURRENT_LOCAL_QUESTIONS = 8  # Questions to process in parallel for local assertions
CONCURRENT_GLOBAL_QUESTIONS = 2  # Questions to process in parallel for global assertions (lower due to internal parallelism, set to 1 for sequential)

llm = ModelFactory.create_chat_model(
    model_config=LLMConfig(
        model=LLM_MODEL,
        api_key=API_KEY,
        llm_provider=LLMProvider.OpenAIChat,
        call_args=LLM_PARAMS,
    )
)

# Create validator if validation is enabled
# Validator checks assertions for:
# - Grounding: Is the assertion supported by source texts?
# - Relevance: Is it useful for evaluating the question?
# - Verifiability: Is it clear and testable?
validator = (
    AssertionValidator(
        llm=llm,
        llm_params=LLM_PARAMS,
        min_criterion_score=MIN_VALIDATION_SCORE,
        concurrent_validations=CONCURRENT_REQUESTS,
    )
    if ENABLE_VALIDATION
    else None
)

if validator:
    print(f"Validation enabled (min score: {MIN_VALIDATION_SCORE}/5)")
else:
    print("Validation disabled")
# DATA CONFIGS
OUTPUT_QUESTIONS_PATH = "../../output/AP_news/questions"

# MODEL CONFIGS
API_KEY = SecretStr(os.getenv("OPENAI_API_KEY", ""))
LLM_MODEL = "gpt-4.1"
LLM_PARAMS = {
    "temperature": 0.0,
    "seed": 42,
}  # adjust this based on your model. For example, some reasoning models do not support temperature settings

# CONCURRENCY CONFIGS
CONCURRENT_REQUESTS = 8  # Concurrent LLM calls. Adjust based on your model capacity.

# ASSERTION GENERATION CONFIGS
MAX_ASSERTIONS = 20  # Maximum number of assertions per question
BATCH_SIZE = 100  # Batch size for processing claims in global assertion generation
MAX_DATA_TOKENS = (
    32000  # Maximum input data tokens for the reduce step in global assertions
)
ENABLE_VALIDATION = True  # Set to True to validate assertions against sources
MIN_VALIDATION_SCORE = 3  # Minimum score (1-5) for grounding, relevance, verifiability

# Parallelism for assertion generation (adjust based on your model rate limits)
CONCURRENT_LOCAL_QUESTIONS = 8  # Questions to process in parallel for local assertions
CONCURRENT_GLOBAL_QUESTIONS = 2  # Questions to process in parallel for global assertions (lower due to internal parallelism, set to 1 for sequential)

llm = ModelFactory.create_chat_model(
    model_config=LLMConfig(
        model=LLM_MODEL,
        api_key=API_KEY,
        llm_provider=LLMProvider.OpenAIChat,
        call_args=LLM_PARAMS,
    )
)

# Create validator if validation is enabled
# Validator checks assertions for:
# - Grounding: Is the assertion supported by source texts?
# - Relevance: Is it useful for evaluating the question?
# - Verifiability: Is it clear and testable?
validator = (
    AssertionValidator(
        llm=llm,
        llm_params=LLM_PARAMS,
        min_criterion_score=MIN_VALIDATION_SCORE,
        concurrent_validations=CONCURRENT_REQUESTS,
    )
    if ENABLE_VALIDATION
    else None
)

if validator:
    print(f"Validation enabled (min score: {MIN_VALIDATION_SCORE}/5)")
else:
    print("Validation disabled")

Generate Assertions for Existing Data-Local Questions¶

Generate assertions for data-local questions that were previously created without assertions.

In [ ]:

Copied!





from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.local_claim_assertion_gen import (
    LocalClaimAssertionGenerator,
)

# Load existing data-local questions from disk
# Replace with your actual path to existing questions
existing_local_questions = load_questions(
    f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/selected_questions.json"
)

print(f"Loaded {len(existing_local_questions)} existing data-local questions")

# Filter questions that have claims
questions_with_claims = [
    q
    for q in existing_local_questions
    if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
    q
    for q in existing_local_questions
    if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]

print(
    f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)

# Initialize local assertion generator with optional validator
local_assertion_generator = LocalClaimAssertionGenerator(
    llm=llm,
    max_assertions=MAX_ASSERTIONS,
    validator=validator,  # Pass validator for quality filtering (None to skip)
    max_concurrent_questions=CONCURRENT_LOCAL_QUESTIONS,  # Process questions in parallel
)

# Generate assertions for all questions with claims (parallel processing)
await local_assertion_generator.agenerate_assertions_for_questions(
    questions_with_claims
)

# Combine back with questions that had no claims
updated_local_questions = questions_with_claims + questions_without_claims

# Save updated questions with assertions
save_questions(
    updated_local_questions,
    f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/",
    "selected_questions_with_assertions",
)

print(f"Saved {len(updated_local_questions)} data-local questions with assertions")
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.local_claim_assertion_gen import (
    LocalClaimAssertionGenerator,
)

# Load existing data-local questions from disk
# Replace with your actual path to existing questions
existing_local_questions = load_questions(
    f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/selected_questions.json"
)

print(f"Loaded {len(existing_local_questions)} existing data-local questions")

# Filter questions that have claims
questions_with_claims = [
    q
    for q in existing_local_questions
    if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
    q
    for q in existing_local_questions
    if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]

print(
    f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)

# Initialize local assertion generator with optional validator
local_assertion_generator = LocalClaimAssertionGenerator(
    llm=llm,
    max_assertions=MAX_ASSERTIONS,
    validator=validator,  # Pass validator for quality filtering (None to skip)
    max_concurrent_questions=CONCURRENT_LOCAL_QUESTIONS,  # Process questions in parallel
)

# Generate assertions for all questions with claims (parallel processing)
await local_assertion_generator.agenerate_assertions_for_questions(
    questions_with_claims
)

# Combine back with questions that had no claims
updated_local_questions = questions_with_claims + questions_without_claims

# Save updated questions with assertions
save_questions(
    updated_local_questions,
    f"{OUTPUT_QUESTIONS_PATH}/data_local_questions/",
    "selected_questions_with_assertions",
)

print(f"Saved {len(updated_local_questions)} data-local questions with assertions")

Generate Assertions for Existing Data-Global Questions¶

Generate assertions for data-global questions that were previously created without assertions. Global assertion generation uses a map-reduce approach, first generating local assertions from referenced questions, then consolidating them into global assertions.

In [ ]:

Copied!





from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.global_claim_assertion_gen import (
    GlobalClaimAssertionGenerator,
)

# Load existing data-global questions from disk
# Replace with your actual path to existing questions
existing_global_questions = load_questions(
    f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/selected_questions.json"
)

print(f"Loaded {len(existing_global_questions)} existing data-global questions")

# Filter questions that have claims
questions_with_claims = [
    q
    for q in existing_global_questions
    if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
    q
    for q in existing_global_questions
    if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]

print(
    f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)

# Initialize global assertion generator with optional validator
global_assertion_generator = GlobalClaimAssertionGenerator(
    llm=llm,
    max_assertions=MAX_ASSERTIONS,
    batch_size=BATCH_SIZE,  # Batch size for processing multiple claims
    max_data_tokens=MAX_DATA_TOKENS,  # max input data tokens for the reduce step
    concurrent_coroutines=CONCURRENT_REQUESTS,
    validator=validator,  # Pass validator for quality filtering (None to skip)
    max_concurrent_questions=CONCURRENT_GLOBAL_QUESTIONS,  # Process questions in parallel (lower due to internal parallelism)
)

# Generate assertions for all questions with claims (parallel processing)
await global_assertion_generator.agenerate_assertions_for_questions(
    questions_with_claims
)

# Combine back with questions that had no claims
updated_global_questions = questions_with_claims + questions_without_claims

# Save updated questions with assertions
save_questions(
    updated_global_questions,
    f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/",
    "selected_questions_with_assertions",
)

print(f"Saved {len(updated_global_questions)} data-global questions with assertions")
from benchmark_qed.autoq.question_gen.data_questions.assertion_gen.global_claim_assertion_gen import (
    GlobalClaimAssertionGenerator,
)

# Load existing data-global questions from disk
# Replace with your actual path to existing questions
existing_global_questions = load_questions(
    f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/selected_questions.json"
)

print(f"Loaded {len(existing_global_questions)} existing data-global questions")

# Filter questions that have claims
questions_with_claims = [
    q
    for q in existing_global_questions
    if hasattr(q, "attributes") and q.attributes and "claims" in q.attributes
]
questions_without_claims = [
    q
    for q in existing_global_questions
    if not (hasattr(q, "attributes") and q.attributes and "claims" in q.attributes)
]

print(
    f"Questions with claims: {len(questions_with_claims)}, without claims: {len(questions_without_claims)}"
)

# Initialize global assertion generator with optional validator
global_assertion_generator = GlobalClaimAssertionGenerator(
    llm=llm,
    max_assertions=MAX_ASSERTIONS,
    batch_size=BATCH_SIZE,  # Batch size for processing multiple claims
    max_data_tokens=MAX_DATA_TOKENS,  # max input data tokens for the reduce step
    concurrent_coroutines=CONCURRENT_REQUESTS,
    validator=validator,  # Pass validator for quality filtering (None to skip)
    max_concurrent_questions=CONCURRENT_GLOBAL_QUESTIONS,  # Process questions in parallel (lower due to internal parallelism)
)

# Generate assertions for all questions with claims (parallel processing)
await global_assertion_generator.agenerate_assertions_for_questions(
    questions_with_claims
)

# Combine back with questions that had no claims
updated_global_questions = questions_with_claims + questions_without_claims

# Save updated questions with assertions
save_questions(
    updated_global_questions,
    f"{OUTPUT_QUESTIONS_PATH}/data_global_questions/",
    "selected_questions_with_assertions",
)

print(f"Saved {len(updated_global_questions)} data-global questions with assertions")

Notes on Assertion Generation¶

When to use this approach:

You have existing questions that were generated with max_assertions=0 or without assertion generation
You want to add evaluation capabilities to previously generated question sets
You need to regenerate assertions with different parameters or improved prompts

Input Requirements:

Questions must have claims in their attributes field
For data-local questions: claims should be a list of claim dictionaries
For data-global questions: claims can be in various formats (simple or complex)

Output Format:

Assertions are added to the question's attributes.assertions field
Each assertion contains a statement that can be used for evaluation
Questions without valid claims are left unchanged

Configuration Options:

MAX_ASSERTIONS: Maximum number of assertions to generate per question (default: 20)
ENABLE_VALIDATION: Set to True to validate assertions for quality (default: True)
MIN_VALIDATION_SCORE: Minimum score (1-5) for validation criteria (default: 3)
BATCH_SIZE: For global questions, controls how many claims are processed together
MAX_DATA_TOKENS: For global questions, controls the max input data tokens in the reduce step
CONCURRENT_REQUESTS: Controls parallel LLM calls for batch processing and validation

Parallelism Settings:

CONCURRENT_LOCAL_QUESTIONS: Questions to process in parallel for local assertions (default: 8)
CONCURRENT_GLOBAL_QUESTIONS: Questions to process in parallel for global assertions (default: 2, lower due to internal parallelism; set to 1 for sequential)

Validation: When ENABLE_VALIDATION=True, each assertion is checked for:

Grounding: Is the assertion factually supported by source texts?
Relevance: Is the assertion useful for evaluating answers to the question?
Verifiability: Is the assertion clear and objectively checkable?

Assertions must score at least MIN_VALIDATION_SCORE on all three criteria to pass validation and be included in the final assertion set.