RHELM: Beyond Static Dialogues

Benchmarking Realistic, Heterogeneous, and Evolving Long-Horizon Memory

Han Zhang1,2,3,*, Zihao Tang3, Xin Yu3,†, Xiao Liu3, Yeyun Gong3, Haizhen Huang3, Yan Lu3, Weiwei Deng3, Feng Sun3, Qi Zhang3, Hanfang Yang1,2,†

1,2Renmin University of China   3Microsoft

*Work done during internships at Microsoft.  

RHELM Overview

RHELM provides realistic, heterogeneous, and temporally evolving memory sources, paired with challenging questions for long-horizon memory evaluation.

Abstract

RHELM is a comprehensive benchmark for evaluating long-horizon memory capabilities in AI systems. Unlike existing benchmarks built around static dialogues, RHELM introduces realistic, heterogeneous, and evolving memory challenges that better reflect real-world assistant scenarios. It pairs multi-source memory (conversations, emails, attachments) with questions requiring multi-hop reasoning, temporal synthesis, preference tracking, and hallucination detection. All characters, events, and personal details are fully synthetic.

Benchmark Construction

RHELM is generated by an iterative pipeline that enriches a persona profile, simulates evolving timelines, and produces multi-source memory together with challenging QA pairs across five dialogue categories.

Algorithm 1: RHELM construction workflow
Algorithm 1. Profile enrichment, timeline rollout / evolution / pruning, and dialogue generation.

Dataset Statistics

ItemCount
Characters (personas)10
QA pairs1,305
Conversation sessions629
Emails625
Attachments1,053
Question TypeCount
attachment249
mixed210
fact207
hallucination197
aggregation192
temporal185
misleading65

Challenge Taxonomy

RHELM organizes questions into 7 categories with 26 challenge characteristics across three QA domains: Dialogue History QA, External Source QA, and Hybrid Context QA. These cover multi-hop traversal, state-dependent attributes, temporal synthesis, memory-conditioned misleading queries, and more.

Table 3: Taxonomy of challenging questions

View Full Taxonomy

Main Results

We evaluate RAG baselines, long-context models, and dedicated memory frameworks, each under two settings: with and without external data sources. Even the strongest systems struggle with cross-source aggregation and hybrid-context reasoning.

Table 4: Detailed performance evaluation on RHELM
The best system (Claude Opus 4.5) reaches only 38.1 average. Adding external sources can even hurt standard types, and RAG collapses on mixed-type queries. All models score <5% on misleading and near-floor on hallucination — though stronger reasoning models resist deceptive premises noticeably better.

Analysis

Figure 2: 10 worst-performing challenging characteristics
The hardest characteristics cluster in cross-source synthesis (Mixed, Aggregation) and realistic requests (Misleading, Hallucination): models confuse evidence origins, mishandle conflicting history, and fabricate facts.
Figure 3: Recall comparison of embedding models
Even at k=50, retrieved evidence stays limited and insufficient for precise query resolution across all embedding models.

QA Format

Each QA file is in JSON Lines format — one JSON object per question–answer pair:

{
  "id": "fact_19130b",
  "question": "... what did I actually have for my first meal of the day?",
  "answer": "Leftover lentil soup",
  "question_date": "2024-10-28",
  "question_type": "fact",
  "supporting_evidence": ["2024-05-26:5"],
  "characteristics": ["State-Dependent Attribute"]
}

supporting_evidence uses the form "<session-date>:<turn-index>" for conversation evidence (e.g. turn 5 of the 2024-05-26 session), or a file/section reference for attachments.

BibTeX

@article{rhelm2026,
  title   = {Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Horizon Memory},
  author  = {Han Zhang and Zihao Tang and Xin Yu and Xiao Liu and Yeyun Gong and Haizhen Huang and Yan Lu and Weiwei Deng and Feng Sun and Qi Zhang and Hanfang Yang},
  journal = {arXiv preprint arXiv:2605.31086},
  year    = {2026}
}