SkillLens

FromRaw ExperiencetoSkill Consumption

A Systematic Study of Model-Generated Agent Skills

Zisu Huang1,2,*,†Jingwen Xu1,*Yifan Yang2,‡Ziyang Gong3Qihao Yang3
Muzhao Tian1Xiaohua Wang1Changze Lv1Xuemei Gao2Qi Dai2Bei Liu2Kai Qiu2
Xue Yang3Dongdong Chen2Xiaoqing Zheng1,‡Chong Luo2

1 Fudan University   2 Microsoft Research   3 Shanghai Jiao Tong University
* Equal contribution   † Work done during an internship at MSRA   ‡ Corresponding authors

Also check out SkillOpt · Controllable Text-Space Optimization for Agent Skills

Why study model-generated skills?

Language agents improve by reusing skills—structured procedural artifacts distilled from past experience.

Among them, domain-level skills—which package a domain's recurring procedures into a single reusable artifact—have become a standard component in commercial agent platforms, but hand-crafting them cannot scale. Model-generated skills are the only path forward, making them the form most likely to shape real agent systems at deployment scale.

This Study
We present a comprehensive study of domain-level, model-generated skills across the full lifecycle, organized around three research questions.
RQ1

Do they work?

Do model-generated, domain-level skills reliably benefit downstream agents across targets, extractors, and domains?

RQ2

What drives utility?

Across the three lifecycle stages, what actually determines a skill's downstream utility?

RQ3

Can we improve it?

Can our empirical findings be turned into a concrete, drop-in improvement to skill extraction itself?

The Skill Lifecycle

We evaluate the full experience-to-skill lifecycle across three stages, systematically varying extractor and target models across five diverse domains.

🎯
Experience Generation
Target agent rolls out tasks to form an experience pool.
⚗️
Skill Extraction
Extractor distills the pool into a reusable domain-level skill.
🚀
Skill Consumption
Skill is loaded back into the target and evaluated.
SkillLens overview figure

Figure 1. Overview of our study design across the three lifecycle stages.

Are Model-Generated Skills Effective?

A large-scale evaluation across 5 domains × 6 targets × 5 extractors. We first introduce the three metrics that ground the study, then summarize three headline findings.

Scorewith skillScorebaseline
Δ Performance Delta

The atomic unit of every cell in the table. Δ > 0 means the skill helped this (extractor, target) pair; Δ < 0 means negative transfer.

Per Extractor
EE Extraction Efficacy

For a fixed extractor, average Δ across all targets — how reliably the extractor produces useful skills.

avgtarget Δ
Per Target
TE Target Evolvability

For a fixed target, average Δ across extractors that distill skills from its own trajectories — how much the target can improve from skills grounded in its own experience.

avgextractor Δ
TargetBaseGPT-5.4GPT-5.4-miniGem-3.1-ProGem-3.1-FLQwen3.5-35BTE
Embodied: ALFWorld
GPT-5.468.66+1.49+6.47+7.46+4.98+4.23+4.93
GPT-5.4-mini52.24+1.00+4.23+2.74+2.24+3.98+2.84
Gem-3.1-Pro87.56+0.50+0.75+0.00−0.75−1.24−0.15
Gem-3.1-FL51.99−2.49−1.24+1.49−2.49−3.23−1.59
Qwen-35B57.21−1.99−3.48−0.75+0.50−1.00−1.34
Qwen-9B36.07−2.49−2.99−1.24−1.99+0.25−1.69
EE−0.66+0.62+1.62+0.42+0.50
Productivity: SpreadsheetBench
GPT-5.437.17+4.33+9.00+14.00+14.66+6.33+9.66
GPT-5.4-mini29.33+0.34+2.50+3.67+4.50+1.00+2.40
Gem-3.1-Pro35.83−0.50−2.67+6.50+5.33+5.83+2.90
Gem-3.1-FL25.00+2.67+1.83+1.50+6.17+7.33+3.90
Qwen-35B23.83+2.00+5.50+0.17+3.34−3.50+1.50
Qwen-9B13.67+1.16+3.16−1.17+1.16+3.00+1.46
EE+1.67+3.22+4.11+5.86+3.33
Coding: SWE-bench-Verified
GPT-5.468.40+4.67+1.33+2.00+4.00+2.27+2.85
GPT-5.4-mini59.73+3.20+3.20+1.73+3.60+2.80+2.91
Gem-3.1-Pro66.53+2.00+2.80+2.13+3.47−1.60+1.76
Gem-3.1-FL55.47+2.67+3.33+2.93+3.47−0.93+2.29
Qwen-35B52.93+3.20+2.00+2.53+2.93+2.00+2.53
Qwen-9B33.33−1.07+2.40−1.60+1.20+0.93+0.37
EE+2.45+2.51+1.62+3.11+0.91
Web Search: SEAL-0
GPT-5.451.24+6.47+4.23+7.71+1.74+1.74+4.38
GPT-5.4-mini45.27−1.49+3.23−3.98+3.98−4.23−0.50
Gem-3.1-Pro55.97−4.23−1.99+1.99+2.49−3.48−1.04
Gem-3.1-FL14.93+9.45+8.21+2.99−1.24+7.21+5.32
Qwen-35B40.55+1.74+6.47−3.73+4.73+2.24+2.29
Qwen-9B33.83+10.70+8.96−5.72+5.97−2.99+3.38
EE+3.77+4.85−0.12+2.95+0.08
Tool Calling: BFCL-v4
GPT-5.451.68+3.08+5.04+0.42+5.04−2.24+2.27
GPT-5.4-mini53.50+3.92+6.16+7.56+5.18+2.94+5.15
Gem-3.1-Pro51.82+5.32+5.88−4.34+0.14+6.02+2.60
Gem-3.1-FL41.18+4.06+4.62+4.20+8.12+3.64+4.93
Qwen-35B57.56−0.70+0.84+1.54+2.80+1.82+1.26
Qwen-9B46.78+2.94+2.94+1.82−0.98−1.12+1.12
EE+3.10+4.25+1.87+3.38+1.84

Table 1. Skill-induced performance gain (Δ) across all five domains. Green: Δ > 0; Red: Δ < 0. Bold: best TE/EE; Underline: second best.

Finding 01
75%/25%

Beneficial on average, but not guaranteed

Skills help in 75% of extractor–target pairs, yet 25% suffer negative transfer. Domain-dependent: ALFWorld is the most fragile (47% negative).

Finding 02
Better ExecutorBetter Extractor

Skill extraction is a distinct capability

On SpreadsheetBench, lightweight Gemini-3.1-FL leads on EE; the strongest executor, GPT-5.4, ranks last. Choosing an extractor is a compatibility problem, not a strength contest.

Finding 03
+4.93vs−1.69

Skill utility is target-dependent

Fix the domain and the extractors, swap only the target: on ALFWorld, GPT-5.4 gains TE = +4.93 while Qwen-9B drops to −1.69. The same skill becomes useful or harmful depending on who consumes it.

What Drives Skill Utility?

We dissect each lifecycle stage—experience, extraction, consumption—to understand what factors actually govern downstream gains.

Stage 01 Raw Experience

The success–failure ratio of experience shapes skill quality.

Trajectory success ratio is not a knob with a single optimum—the right mix depends on the domain. Successful trajectories anchor the procedure, but well-chosen failures reveal where the agent breaks.

All-failure pools consistently produce the worst skills. Between the two ends, ALFWorld benefits from failure-heavy mixes, while SpreadsheetBench peaks with success-heavy ones.

success-heavySpreadsheetBench
failure-heavyALFWorld
Figure · Raw Experience
Experience ratio analysis

Performance gain vs. success ratio in the experience pool. Optimal ratio is domain-specific, not universal.

Stage 02 Skill Extraction

Surface plausibility does not predict utility.

An LLM judge asked to compare two skills picks the better one only 46.4% of the time—worse than chance—and accuracy drops as the utility gap grows.

Format alone is also inert: rewriting the same skill into different surface formats yields statistically indistinguishable downstream gains (paired test, p > 0.34). What actually matters is concrete failure mechanisms with executable remedies—not generic advice.

46.4%judge accuracy
p > 0.34format change, no effect
Figure · Skill Extraction
Pairwise judge accuracy

Pairwise judge accuracy drops as the actual utility gap grows—the skill that reads better often performs worse.

Stage 03 Skill Consumption

The same skill, different consumers, opposite outcomes.

Transfer is highly target-dependent. Skills extracted from strong experience pools transfer robustly to other targets; skills from weak pools cause negative transfer on some.

Mechanistically, skills reshape the target's default policy rather than triggering explicit skill calls—so consumption ability is shaped by what the target can already act on.

target-dependenttransfer pattern
reshapes policymechanism of effect
Figure · Skill Consumption
Cross-model skill transfer

Cross-model transfer matrix. Strong-pool skills help every target; weak-pool skills produce mixed or negative results.

Meta-Skill Guided Extraction

Can we turn the analysis findings into a concrete, drop-in improvement to skill extraction?

Step 01 Diagnose

Three textual dimensions predict utility.

An automated pipeline contrasts high- vs. low-utility skill pairs and surfaces three validated dimensions: failure-mechanism encoding, actionable specificity, and a high-risk action blacklist—each individually raising better-rate above 64%.

46.4% → 73.8%pairwise-judge accuracy at picking the truly better skill
Step 02 Intervene

The rubric becomes a drop-in meta-skill.

We compile the validated rubric into a compact meta-skill prepended to the extractor's prompt—no pipeline changes, no extra calls—turning diagnostic findings into a measurable downstream gain that is consistent across extractors and domains.

+1.55 ppaverage gain over baseline
9 / 9(domain × target) cells improved
Figure · Meta-Skill Effect Meta-skill consistently lifts skill utility across domains and targets
Meta-skill slope improvement

Citation

@article{huang2026skilllens,
  title         = {From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills},
  author        = {Zisu Huang and Jingwen Xu and Yifan Yang and Ziyang Gong and Qihao Yang and Muzhao Tian and Xiaohua Wang and Changze Lv and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Xue Yang and Dongdong Chen and Xiaoqing Zheng and Chong Luo},
  year          = {2026},
  journal       = {arXiv preprint arXiv:2605.23899},
  eprint        = {2605.23899},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2605.23899}
}