Gistify! Codebase-Level Understanding via Runtime Execution

TL;DR

Main gif

⚡️ We introduce Gistify, a task that asks an agent to extract the gist of a repository: generate a single, executable, and self-contained file that faithfully reproduces the behavior of a given command (e.g., a test or entrypoint).

💡 Gistify evaluates an agent’s ability to reason over runtime execution, identify only the minimal required code, and avoid hallucinations.

🤯 Even strong LLMs/frameworks struggle, especially on long, multi-file traces; execution or global-context tools provide modest gains, and early success of preserving the original test/entrypoint is closely tied to correctly understanding the repository and an indicator of success.

Limitations of existing codebase-level understanding tasks

Prior benchmarks often don’t require full runtime reasoning; many can be solved with heuristic shortcuts or localized patch retrieval.
Many datasets are built from GitHub issues or pull requests, making them difficult to generalize to arbitrary repositories.
As coding agents move into large, real-world codebases, we need automatically constructed, broadly applicable, and more challenging repository-level evaluations.

What is Gistify?

Main concept

🔍 Given (1) a codebase (Dockerized for consistent eval) and (2) an entry command (e.g., pytest path::test_case), the agent must produce a single gistified file that:

Self-contained: includes everything it needs and is executable independent of the original repo

Execution fidelity: same outputs/behavior with original codebase under the given command

Minimality: retain only the essential code required to execute

Faithful preservation: every line is preserved from the repo (no hallucinated code)

Why does this matter?

Gistify directly tests whether models can reason at the codebase level, including runtime behavior and cross-file execution.
It enables a lightweight, broadly applicable evaluation that can automatically generate challenging tasks for any repository, including private ones.
The resulting gistified files are valuable artifacts: by compressing a specific behavior into a minimal, standalone script, they can support downstream tasks such as automated debugging and error localization.

How the task is evaluated

We align metrics with the requirements of Gistify:

Execution Fidelity: Is the generated file executable and produces the same output as the full codebase under the given command?
Line Existence Rate: What fraction of lines in the generated file come from the original repository? (Measuring no hallucination)
Line Execution Rate: What fraction of the lines in the generated file were actually executed under that command? (Measuring minimality)

In short:
Does it work? Does it reuse original code instead of inventing new logic? And is it as lean as possible?

Experiments

Main results

📈 We evaluate Gistify across multiple agent frameworks (Mini-SWE-Agent, SWE-Agent, Copilot) and model families (GPT-5-mini, GPT-5, Claude-3.7-Sonnet, Claude-4).

Even with strong models and frameworks, success remains limited: reliably producing a correct gistified file is still difficult, with Claude-4 performing the most robustly overall.
Model scale matters: smaller/earlier models perform worse, though scaffolding can partially help.
Different model families exhibit distinct strengths: Claude-4 preserves original code (high line-existence), while GPT-5 produces more concise files (high line-execution).
Execution tools offer limited improvement: adding execution tool access yields only modest gains—the core challenge remains tracing and extracting behavior spread across many files.

Effect of various strategies and Tools

Ablation results

Tool-assisted setups outperform prompt-only guidance: Global Info (Tool), Execution (Tool) tend to show higher performance than modifying prompt only (Prompted Strategies) generally.
Tracing > RepoGraph: Access to global context, especially gold execution traces, greatly improves runtime reasoning by helping the model pinpoint which files matter.
More freedom isn’t always better: Full terminal access can hurt performance. Broad exploration often hits step limits (Max Steps Reached) and still fails to produce a well-constructed gistified file.

What makes Gistify difficult?

Difficult gistify

Two axes correlate with lower success:

Trace length: deeper call chains mean more code to inline, more state to track
File coverage: spreading logic across many files increases the risk of missed edges, mocks, or import errors

What about static coding LLM?

result bar

Agentic > Static: iterative read-think-act workflows consistently outperform one-shot/static generation.
Static models can still score well on Line Existence: when given the gold relevant code, since the task reduces to reorganizing known snippets rather than discovering them.

What’s next?

🚀 Enhance Gistify performance: Explore richer execution traces, smarter navigation tools, and dynamic file-reading to improve runtime understanding.

🛠️ Leverage gistified files downstream: Apply gistified outputs to tasks such as debugging, code review, and feature extraction.

References

(1) Yang, John, et al. “Swe-agent: Agent-computer interfaces enable automated software engineering.” Advances in Neural Information Processing Systems 37, 2024.

(2) Ouyang, Siru, et al. “Repograph: Enhancing ai software engineering with repository-level code graph” In Proceedings of ICLR 2025, 2025.