RHELM: Beyond Static Dialogues

Benchmarking Realistic, Heterogeneous, and Evolving Long-Horizon Memory

Han Zhang^1,2,3,*, Zihao Tang³, Xin Yu^3,†, Xiao Liu³, Yeyun Gong³, Haizhen Huang³, Yan Lu³, Weiwei Deng³, Feng Sun³, Qi Zhang³, Hanfang Yang^1,2,†

^1,2Renmin University of China ³Microsoft

^*Work done during internships at Microsoft.

Paper Dataset Code

RHELM provides realistic, heterogeneous, and temporally evolving memory sources, paired with challenging questions for long-horizon memory evaluation.

Abstract

RHELM is a comprehensive benchmark for evaluating long-horizon memory capabilities in AI systems. Unlike existing benchmarks built around static dialogues, RHELM introduces realistic, heterogeneous, and evolving memory challenges that better reflect real-world assistant scenarios. It pairs multi-source memory (conversations, emails, attachments) with questions requiring multi-hop reasoning, temporal synthesis, preference tracking, and hallucination detection. All characters, events, and personal details are fully synthetic.

Benchmark Construction

RHELM is generated by an iterative pipeline that enriches a persona profile, simulates evolving timelines, and produces multi-source memory together with challenging QA pairs across five dialogue categories.

Algorithm 1: RHELM construction workflow — **Algorithm 1.** Profile enrichment, timeline rollout / evolution / pruning, and dialogue generation.

Dataset Statistics

Item	Count
Characters (personas)	10
QA pairs	1,305
Conversation sessions	629
Emails	625
Attachments	1,053

Question Type	Count
attachment	249
mixed	210
fact	207
hallucination	197
aggregation	192
temporal	185
misleading	65

Challenge Taxonomy

RHELM organizes questions into 7 categories with 26 challenge characteristics across three QA domains: Dialogue History QA, External Source QA, and Hybrid Context QA. These cover multi-hop traversal, state-dependent attributes, temporal synthesis, memory-conditioned misleading queries, and more.

Table 3: Taxonomy of challenging questions

View Full Taxonomy

Main Results

We evaluate RAG baselines, long-context models, and dedicated memory frameworks, each under two settings: with and without external data sources. Even the strongest systems struggle with cross-source aggregation and hybrid-context reasoning.

Table 4: Detailed performance evaluation on RHELM — The best system (Claude Opus 4.5) reaches only **38.1** average. Adding external sources can even hurt standard types, and RAG collapses on mixed-type queries. All models score **<5%** on misleading and near-floor on hallucination — though stronger reasoning models resist deceptive premises noticeably better.

Analysis

Figure 2: 10 worst-performing challenging characteristics — The hardest characteristics cluster in cross-source synthesis (*Mixed*, *Aggregation*) and realistic requests (*Misleading*, *Hallucination*): models confuse evidence origins, mishandle conflicting history, and fabricate facts.

Figure 3: Recall comparison of embedding models — Even at **k=50**, retrieved evidence stays limited and insufficient for precise query resolution across all embedding models.

QA Format

Each QA file is in JSON Lines format — one JSON object per question–answer pair:

{
  "id": "fact_19130b",
  "question": "... what did I actually have for my first meal of the day?",
  "answer": "Leftover lentil soup",
  "question_date": "2024-10-28",
  "question_type": "fact",
  "supporting_evidence": ["2024-05-26:5"],
  "characteristics": ["State-Dependent Attribute"]
}

supporting_evidence uses the form "<session-date>:<turn-index>" for conversation evidence (e.g. turn 5 of the 2024-05-26 session), or a file/section reference for attachments.

BibTeX

@article{rhelm2026,
  title   = {Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Horizon Memory},
  author  = {Han Zhang and Zihao Tang and Xin Yu and Xiao Liu and Yeyun Gong and Haizhen Huang and Yan Lu and Weiwei Deng and Feng Sun and Qi Zhang and Hanfang Yang},
  journal = {arXiv preprint arXiv:2605.31086},
  year    = {2026}
}