SkillLens — A Systematic Study of Model-Generated Agent Skills

Motivation

Why study model-generated skills?

Language agents improve by reusing skills—structured procedural artifacts distilled from past experience.

Among them, domain-level skills—which package a domain's recurring procedures into a single reusable artifact—have become a standard component in commercial agent platforms, but hand-crafting them cannot scale. Model-generated skills are the only path forward, making them the form most likely to shape real agent systems at deployment scale.

This Study

We present a comprehensive study of domain-level, model-generated skills across the full lifecycle, organized around three research questions.

RQ1

Do they work?

Do model-generated, domain-level skills reliably benefit downstream agents across targets, extractors, and domains?

RQ2

What drives utility?

Across the three lifecycle stages, what actually determines a skill's downstream utility?

RQ3

Can we improve it?

Can our empirical findings be turned into a concrete, drop-in improvement to skill extraction itself?

RQ1 · Main Results

Are Model-Generated Skills Effective?

A large-scale evaluation across 5 domains × 6 targets × 5 extractors. We first introduce the three metrics that ground the study, then summarize three headline findings.

Score_{with skill}−Score_baseline

Δ Performance Delta

The atomic unit of every cell in the table. Δ > 0 means the skill helped this (extractor, target) pair; Δ < 0 means negative transfer.

Per Extractor

EE Extraction Efficacy

For a fixed extractor, average Δ across all targets — how reliably the extractor produces useful skills.

avg_target Δ

Per Target

TE Target Evolvability

For a fixed target, average Δ across extractors that distill skills from its own trajectories — how much the target can improve from skills grounded in its own experience.

avg_extractor Δ

Target	Base	GPT-5.4	GPT-5.4-mini	Gem-3.1-Pro	Gem-3.1-FL	Qwen3.5-35B	TE
Embodied: ALFWorld
GPT-5.4	68.66	+1.49	+6.47	+7.46	+4.98	+4.23	+4.93
GPT-5.4-mini	52.24	+1.00	+4.23	+2.74	+2.24	+3.98	+2.84
Gem-3.1-Pro	87.56	+0.50	+0.75	+0.00	−0.75	−1.24	−0.15
Gem-3.1-FL	51.99	−2.49	−1.24	+1.49	−2.49	−3.23	−1.59
Qwen-35B	57.21	−1.99	−3.48	−0.75	+0.50	−1.00	−1.34
Qwen-9B	36.07	−2.49	−2.99	−1.24	−1.99	+0.25	−1.69
EE		−0.66	+0.62	+1.62	+0.42	+0.50
Productivity: SpreadsheetBench
GPT-5.4	37.17	+4.33	+9.00	+14.00	+14.66	+6.33	+9.66
GPT-5.4-mini	29.33	+0.34	+2.50	+3.67	+4.50	+1.00	+2.40
Gem-3.1-Pro	35.83	−0.50	−2.67	+6.50	+5.33	+5.83	+2.90
Gem-3.1-FL	25.00	+2.67	+1.83	+1.50	+6.17	+7.33	+3.90
Qwen-35B	23.83	+2.00	+5.50	+0.17	+3.34	−3.50	+1.50
Qwen-9B	13.67	+1.16	+3.16	−1.17	+1.16	+3.00	+1.46
EE		+1.67	+3.22	+4.11	+5.86	+3.33
Coding: SWE-bench-Verified
GPT-5.4	68.40	+4.67	+1.33	+2.00	+4.00	+2.27	+2.85
GPT-5.4-mini	59.73	+3.20	+3.20	+1.73	+3.60	+2.80	+2.91
Gem-3.1-Pro	66.53	+2.00	+2.80	+2.13	+3.47	−1.60	+1.76
Gem-3.1-FL	55.47	+2.67	+3.33	+2.93	+3.47	−0.93	+2.29
Qwen-35B	52.93	+3.20	+2.00	+2.53	+2.93	+2.00	+2.53
Qwen-9B	33.33	−1.07	+2.40	−1.60	+1.20	+0.93	+0.37
EE		+2.45	+2.51	+1.62	+3.11	+0.91
Web Search: SEAL-0
GPT-5.4	51.24	+6.47	+4.23	+7.71	+1.74	+1.74	+4.38
GPT-5.4-mini	45.27	−1.49	+3.23	−3.98	+3.98	−4.23	−0.50
Gem-3.1-Pro	55.97	−4.23	−1.99	+1.99	+2.49	−3.48	−1.04
Gem-3.1-FL	14.93	+9.45	+8.21	+2.99	−1.24	+7.21	+5.32
Qwen-35B	40.55	+1.74	+6.47	−3.73	+4.73	+2.24	+2.29
Qwen-9B	33.83	+10.70	+8.96	−5.72	+5.97	−2.99	+3.38
EE		+3.77	+4.85	−0.12	+2.95	+0.08
Tool Calling: BFCL-v4
GPT-5.4	51.68	+3.08	+5.04	+0.42	+5.04	−2.24	+2.27
GPT-5.4-mini	53.50	+3.92	+6.16	+7.56	+5.18	+2.94	+5.15
Gem-3.1-Pro	51.82	+5.32	+5.88	−4.34	+0.14	+6.02	+2.60
Gem-3.1-FL	41.18	+4.06	+4.62	+4.20	+8.12	+3.64	+4.93
Qwen-35B	57.56	−0.70	+0.84	+1.54	+2.80	+1.82	+1.26
Qwen-9B	46.78	+2.94	+2.94	+1.82	−0.98	−1.12	+1.12
EE		+3.10	+4.25	+1.87	+3.38	+1.84

Table 1. Skill-induced performance gain (Δ) across all five domains. Green: Δ > 0; Red: Δ < 0. Bold: best TE/EE; Underline: second best.

Three Headline Findings

Finding 01

75%/25%

Beneficial on average, but not guaranteed

Skills help in 75% of extractor–target pairs, yet 25% suffer negative transfer. Domain-dependent: ALFWorld is the most fragile (47% negative).

Finding 02

Better Executor≠Better Extractor

Skill extraction is a distinct capability

On SpreadsheetBench, lightweight Gemini-3.1-FL leads on EE; the strongest executor, GPT-5.4, ranks last. Choosing an extractor is a compatibility problem, not a strength contest.

Finding 03

+4.93vs−1.69

Skill utility is target-dependent

Fix the domain and the extractors, swap only the target: on ALFWorld, GPT-5.4 gains TE = +4.93 while Qwen-9B drops to −1.69. The same skill becomes useful or harmful depending on who consumes it.

RQ2 · In-Depth Analysis

What Drives Skill Utility?

We dissect each lifecycle stage—experience, extraction, consumption—to understand what factors actually govern downstream gains.

Stage 01 Raw Experience

The success–failure ratio of experience shapes skill quality.

Trajectory success ratio is not a knob with a single optimum—the right mix depends on the domain. Successful trajectories anchor the procedure, but well-chosen failures reveal where the agent breaks.

All-failure pools consistently produce the worst skills. Between the two ends, ALFWorld benefits from failure-heavy mixes, while SpreadsheetBench peaks with success-heavy ones.

success-heavySpreadsheetBench

failure-heavyALFWorld

Figure · Raw Experience

Performance gain vs. success ratio in the experience pool. Optimal ratio is domain-specific, not universal.

Stage 02 Skill Extraction

Surface plausibility does not predict utility.

An LLM judge asked to compare two skills picks the better one only 46.4% of the time—worse than chance—and accuracy drops as the utility gap grows.

Format alone is also inert: rewriting the same skill into different surface formats yields statistically indistinguishable downstream gains (paired test, p > 0.34). What actually matters is concrete failure mechanisms with executable remedies—not generic advice.

46.4%judge accuracy

p > 0.34format change, no effect

Figure · Skill Extraction

Pairwise judge accuracy drops as the actual utility gap grows—the skill that reads better often performs worse.

Stage 03 Skill Consumption

The same skill, different consumers, opposite outcomes.

Transfer is highly target-dependent. Skills extracted from strong experience pools transfer robustly to other targets; skills from weak pools cause negative transfer on some.

Mechanistically, skills reshape the target's default policy rather than triggering explicit skill calls—so consumption ability is shaped by what the target can already act on.

target-dependenttransfer pattern

reshapes policymechanism of effect

Figure · Skill Consumption

Cross-model transfer matrix. Strong-pool skills help every target; weak-pool skills produce mixed or negative results.

RQ3 · From Diagnosis to Intervention

Meta-Skill Guided Extraction

Can we turn the analysis findings into a concrete, drop-in improvement to skill extraction?

Step 01 Diagnose

Three textual dimensions predict utility.

An automated pipeline contrasts high- vs. low-utility skill pairs and surfaces three validated dimensions: failure-mechanism encoding, actionable specificity, and a high-risk action blacklist—each individually raising better-rate above 64%.

46.4% → 73.8%pairwise-judge accuracy at picking the truly better skill

Step 02 Intervene

The rubric becomes a drop-in meta-skill.

We compile the validated rubric into a compact meta-skill prepended to the extractor's prompt—no pipeline changes, no extra calls—turning diagnostic findings into a measurable downstream gain that is consistent across extractors and domains.

+1.55 ppaverage gain over baseline

9 / 9(domain × target) cells improved

Figure · Meta-Skill Effect Meta-skill consistently lifts skill utility across domains and targets

FromRaw ExperiencetoSkill Consumption