SkillOpt’s core insight: optimizing natural-language skill documents follows the same structure as training neural networks.
┌─────────────────────────────────────────────────────────┐
│ Training Loop │
│ │
│ for epoch in epochs: │
│ for step in steps: │
│ 1. Rollout — Target executes tasks │
│ 2. Reflect — Optimizer analyzes trajectories │
│ 3. Aggregate — Hierarchical merge of patches │
│ 4. Select — Rank & clip edits (learning rate) │
│ 5. Update — Apply patches to skill doc │
│ 6. Gate — Validate & accept/reject │
│ │
│ Epoch Boundary: │
│ • Slow Update (longitudinal comparison & guidance) │
│ • Meta Skill (cross-epoch strategy memory) │
└─────────────────────────────────────────────────────────┘
The target model executes tasks using the current skill document as its prompt. Each task produces a trajectory and a score.
# Analogy: forward pass through the network
predictions = model(input, skill_document)
scores = evaluate(predictions, ground_truth)
The optimizer model analyzes failed trajectories and produces edit patches — structured suggestions for improving the skill document.
Two modes:
# Analogy: computing gradients
gradients = loss.backward() # → edit patches
Semantically similar edit patches are merged to avoid redundant edits.
Edits are ranked by relevance score. The learning_rate parameter caps how many edits are applied per step — just like gradient clipping prevents overshooting.
# Analogy: gradient clipping + optimizer step size
selected = top_k(edits, k=learning_rate)
The lr_scheduler adjusts this over training:
Selected edits are applied to the skill document, producing a new version.
The updated skill is evaluated on a selection split (analogous to a validation set). The update is only accepted if performance improves.
At the end of each epoch (starting from epoch 2), the system performs a longitudinal comparison: it rolls out both the previous epoch’s skill and the current skill on the same samples, categorizes items as improved/regressed/persistent_fail/stable_success, then generates high-level guidance that is injected into the skill document. This prevents catastrophic forgetting of earlier improvements.
A meta-skill memory accumulates high-level strategy notes across the entire training run. At the end of each epoch, the optimizer reflects on what changed between epochs and produces a compact memory that is provided as additional context during future reflection steps.