A Systematic Study of Model-Generated Agent Skills
1 Fudan University
2 Microsoft Research
3 Shanghai Jiao Tong University
* Equal contribution † Work done during an internship at MSRA ‡ Corresponding authors
Language agents improve by reusing skills—structured procedural artifacts distilled from past experience.
Among them, domain-level skills—which package a domain's recurring procedures into a single reusable artifact—have become a standard component in commercial agent platforms, but hand-crafting them cannot scale. Model-generated skills are the only path forward, making them the form most likely to shape real agent systems at deployment scale.
Do model-generated, domain-level skills reliably benefit downstream agents across targets, extractors, and domains?
Across the three lifecycle stages, what actually determines a skill's downstream utility?
Can our empirical findings be turned into a concrete, drop-in improvement to skill extraction itself?
We evaluate the full experience-to-skill lifecycle across three stages, systematically varying extractor and target models across five diverse domains.
Figure 1. Overview of our study design across the three lifecycle stages.
A large-scale evaluation across 5 domains × 6 targets × 5 extractors. We first introduce the three metrics that ground the study, then summarize three headline findings.
The atomic unit of every cell in the table. Δ > 0 means the skill helped this (extractor, target) pair; Δ < 0 means negative transfer.
For a fixed extractor, average Δ across all targets — how reliably the extractor produces useful skills.
avgtarget ΔFor a fixed target, average Δ across extractors that distill skills from its own trajectories — how much the target can improve from skills grounded in its own experience.
avgextractor Δ| Target | Base | GPT-5.4 | GPT-5.4-mini | Gem-3.1-Pro | Gem-3.1-FL | Qwen3.5-35B | TE |
|---|---|---|---|---|---|---|---|
| Embodied: ALFWorld | |||||||
GPT-5.4 | 68.66 | +1.49 | +6.47 | +7.46 | +4.98 | +4.23 | +4.93 |
GPT-5.4-mini | 52.24 | +1.00 | +4.23 | +2.74 | +2.24 | +3.98 | +2.84 |
Gem-3.1-Pro | 87.56 | +0.50 | +0.75 | +0.00 | −0.75 | −1.24 | −0.15 |
Gem-3.1-FL | 51.99 | −2.49 | −1.24 | +1.49 | −2.49 | −3.23 | −1.59 |
Qwen-35B | 57.21 | −1.99 | −3.48 | −0.75 | +0.50 | −1.00 | −1.34 |
Qwen-9B | 36.07 | −2.49 | −2.99 | −1.24 | −1.99 | +0.25 | −1.69 |
| EE | −0.66 | +0.62 | +1.62 | +0.42 | +0.50 | ||
| Productivity: SpreadsheetBench | |||||||
GPT-5.4 | 37.17 | +4.33 | +9.00 | +14.00 | +14.66 | +6.33 | +9.66 |
GPT-5.4-mini | 29.33 | +0.34 | +2.50 | +3.67 | +4.50 | +1.00 | +2.40 |
Gem-3.1-Pro | 35.83 | −0.50 | −2.67 | +6.50 | +5.33 | +5.83 | +2.90 |
Gem-3.1-FL | 25.00 | +2.67 | +1.83 | +1.50 | +6.17 | +7.33 | +3.90 |
Qwen-35B | 23.83 | +2.00 | +5.50 | +0.17 | +3.34 | −3.50 | +1.50 |
Qwen-9B | 13.67 | +1.16 | +3.16 | −1.17 | +1.16 | +3.00 | +1.46 |
| EE | +1.67 | +3.22 | +4.11 | +5.86 | +3.33 | ||
| Coding: SWE-bench-Verified | |||||||
GPT-5.4 | 68.40 | +4.67 | +1.33 | +2.00 | +4.00 | +2.27 | +2.85 |
GPT-5.4-mini | 59.73 | +3.20 | +3.20 | +1.73 | +3.60 | +2.80 | +2.91 |
Gem-3.1-Pro | 66.53 | +2.00 | +2.80 | +2.13 | +3.47 | −1.60 | +1.76 |
Gem-3.1-FL | 55.47 | +2.67 | +3.33 | +2.93 | +3.47 | −0.93 | +2.29 |
Qwen-35B | 52.93 | +3.20 | +2.00 | +2.53 | +2.93 | +2.00 | +2.53 |
Qwen-9B | 33.33 | −1.07 | +2.40 | −1.60 | +1.20 | +0.93 | +0.37 |
| EE | +2.45 | +2.51 | +1.62 | +3.11 | +0.91 | ||
| Web Search: SEAL-0 | |||||||
GPT-5.4 | 51.24 | +6.47 | +4.23 | +7.71 | +1.74 | +1.74 | +4.38 |
GPT-5.4-mini | 45.27 | −1.49 | +3.23 | −3.98 | +3.98 | −4.23 | −0.50 |
Gem-3.1-Pro | 55.97 | −4.23 | −1.99 | +1.99 | +2.49 | −3.48 | −1.04 |
Gem-3.1-FL | 14.93 | +9.45 | +8.21 | +2.99 | −1.24 | +7.21 | +5.32 |
Qwen-35B | 40.55 | +1.74 | +6.47 | −3.73 | +4.73 | +2.24 | +2.29 |
Qwen-9B | 33.83 | +10.70 | +8.96 | −5.72 | +5.97 | −2.99 | +3.38 |
| EE | +3.77 | +4.85 | −0.12 | +2.95 | +0.08 | ||
| Tool Calling: BFCL-v4 | |||||||
GPT-5.4 | 51.68 | +3.08 | +5.04 | +0.42 | +5.04 | −2.24 | +2.27 |
GPT-5.4-mini | 53.50 | +3.92 | +6.16 | +7.56 | +5.18 | +2.94 | +5.15 |
Gem-3.1-Pro | 51.82 | +5.32 | +5.88 | −4.34 | +0.14 | +6.02 | +2.60 |
Gem-3.1-FL | 41.18 | +4.06 | +4.62 | +4.20 | +8.12 | +3.64 | +4.93 |
Qwen-35B | 57.56 | −0.70 | +0.84 | +1.54 | +2.80 | +1.82 | +1.26 |
Qwen-9B | 46.78 | +2.94 | +2.94 | +1.82 | −0.98 | −1.12 | +1.12 |
| EE | +3.10 | +4.25 | +1.87 | +3.38 | +1.84 | ||
Table 1. Skill-induced performance gain (Δ) across all five domains. Green: Δ > 0; Red: Δ < 0. Bold: best TE/EE; Underline: second best.
Skills help in 75% of extractor–target pairs, yet 25% suffer negative transfer. Domain-dependent: ALFWorld is the most fragile (47% negative).
On SpreadsheetBench, lightweight Gemini-3.1-FL leads on EE; the strongest executor, GPT-5.4, ranks last. Choosing an extractor is a compatibility problem, not a strength contest.
Fix the domain and the extractors, swap only the target: on ALFWorld, GPT-5.4 gains TE = +4.93 while Qwen-9B drops to −1.69. The same skill becomes useful or harmful depending on who consumes it.
We dissect each lifecycle stage—experience, extraction, consumption—to understand what factors actually govern downstream gains.
Trajectory success ratio is not a knob with a single optimum—the right mix depends on the domain. Successful trajectories anchor the procedure, but well-chosen failures reveal where the agent breaks.
All-failure pools consistently produce the worst skills. Between the two ends, ALFWorld benefits from failure-heavy mixes, while SpreadsheetBench peaks with success-heavy ones.
Performance gain vs. success ratio in the experience pool. Optimal ratio is domain-specific, not universal.
An LLM judge asked to compare two skills picks the better one only 46.4% of the time—worse than chance—and accuracy drops as the utility gap grows.
Format alone is also inert: rewriting the same skill into different surface formats yields statistically indistinguishable downstream gains (paired test, p > 0.34). What actually matters is concrete failure mechanisms with executable remedies—not generic advice.
Pairwise judge accuracy drops as the actual utility gap grows—the skill that reads better often performs worse.
Transfer is highly target-dependent. Skills extracted from strong experience pools transfer robustly to other targets; skills from weak pools cause negative transfer on some.
Mechanistically, skills reshape the target's default policy rather than triggering explicit skill calls—so consumption ability is shaped by what the target can already act on.
Cross-model transfer matrix. Strong-pool skills help every target; weak-pool skills produce mixed or negative results.
Can we turn the analysis findings into a concrete, drop-in improvement to skill extraction?
An automated pipeline contrasts high- vs. low-utility skill pairs and surfaces three validated dimensions: failure-mechanism encoding, actionable specificity, and a high-risk action blacklist—each individually raising better-rate above 64%.
We compile the validated rubric into a compact meta-skill prepended to the extractor's prompt—no pipeline changes, no extra calls—turning diagnostic findings into a measurable downstream gain that is consistent across extractors and domains.
@article{huang2026skilllens,
title = {From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills},
author = {Zisu Huang and Jingwen Xu and Yifan Yang and Ziyang Gong and Qihao Yang and Muzhao Tian and Xiaohua Wang and Changze Lv and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Xue Yang and Dongdong Chen and Xiaoqing Zheng and Chong Luo},
year = {2026},
journal = {arXiv preprint arXiv:2605.23899},
eprint = {2605.23899},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2605.23899}
}