SkillOpt Documentation & Reproduction Guide GitHub ↗ Paper ↗
Microsoft Research

SkillOpt Documentation & Reproduction Guide

Train agent skills like you train neural networks — with epochs, (mini-)batch size, learning rates, and validation gates — but without touching any model weights.

This guide walks you from a clean checkout to a reproduced result and a full reference for every configuration knob and core function. It is generated from, and kept consistent with, the current state of the codebase.

1.1 What is SkillOpt #

SkillOpt is a text-space optimizer that improves a frozen language agent by iteratively editing a natural-language skill document — never the model weights. The skill document is a Markdown file that conditions a target model as it executes tasks. SkillOpt treats this document as the "weights" and runs a training loop that mirrors deep-learning training: rollout (forward pass), reflect (backward pass / gradients), select & apply edits (optimizer step), and a validation gate (accept/reject).

Two roles split every model call:

  • Target — executes tasks using the current skill document (the agent being improved).
  • Optimizer — analyzes the target's trajectories and proposes edits to the skill document.

The same loop drives six benchmarks out of the box (QA, document QA, embodied agents, math, spreadsheet code generation, and tool-augmented QA).

1.2 Deep-Learning ↔ SkillOpt Analogy #

Every concept below maps to a concrete code construct, so deep-learning intuitions transfer directly to hyperparameter tuning.

Deep learningSkillOptWhere it lives
Model weightsSkill document (Markdown)skillopt/optimizer/skill.py
Forward passRollout — target runs tasksenvs/<bench>/rollout.py
Loss / scoreTask evaluatorenvs/<bench>/evaluator.py
Backprop / gradientsReflect → edit patchesgradient/reflect.py
Gradient aggregationHierarchical patch mergegradient/aggregate.py
Gradient clippingRank & select top-k editsoptimizer/clip.py
Learning rateoptimizer.learning_rate (edits/step)optimizer/scheduler.py
LR schedulerlr_scheduler (cosine/linear/…)optimizer/scheduler.py
Optimizer stepApply patches to the documentoptimizer/skill.py
Validation setSelection split (valid_seen)evaluation/gate.py
Early stopping / acceptValidation gateevaluation/gate.py
MomentumSlow update (epoch boundary)optimizer/slow_update.py
Meta-learningMeta skill (cross-epoch memory)optimizer/meta_skill.py
Batch / minibatchbatch_size / minibatch_sizeengine/trainer.py
EpochEpoch (+ slow update & meta skill)engine/trainer.py
What transfers from DL

Cosine schedule tends to beat constant; moderate learning rates (≈4–16 edits/step) beat very high/low; slow update curbs cross-epoch forgetting; meta-skill memory improves reflection quality. Conversely, bigger rollout batches and many epochs show diminishing returns — skills converge in ~2–4 epochs.

1.3 Key Features #

Validation gating

Every candidate skill is scored on a held-out selection split and only accepted if it beats the current/best skill.

Slow update

Epoch-boundary longitudinal comparison writes guidance into a protected region — momentum against forgetting. Force-injected or selection-gated.

Meta skill

Optimizer-side memory that reflects on what worked across epochs and feeds back into reflection.

Pluggable backends

OpenAI / Azure OpenAI, Anthropic Claude, local Qwen (vLLM), plus Codex/Claude-Code exec backends for the target.

Six benchmarks

SearchQA, DocVQA, ALFWorld, LiveMathematicianBench, SpreadsheetBench, OfficeQA — each a self-contained env module.

Auto-resume

Every run is checkpointed step-by-step; re-running the same command continues from the last completed step.

1.4 Repository Layout #

# top level
configs/            # YAML configs (_base_ + per-benchmark)
scripts/            # train.py, eval_only.py CLIs
ckpt/               # packaged reference skills (e.g. gpt5.5_skill.md)
docs/               # this guide + mkdocs sources
skillopt/           # the package
 ├─ config.py        # YAML loading, _base_ inheritance, flatten
 ├─ engine/trainer.py# the training loop (ReflACTTrainer)
 ├─ gradient/        # reflect.py (analyst), aggregate.py (merge)
 ├─ optimizer/       # skill edits, scheduler, clip, slow_update, meta_skill
 ├─ evaluation/      # gate.py (accept/reject logic)
 ├─ model/           # backend clients + routing
 └─ envs/<benchmark>/ # adapter, dataloader, rollout, evaluator, reflect

2.1 Requirements #

  • Python ≥ 3.10
  • Credentials for at least one model backend (Azure OpenAI, OpenAI-compatible, Anthropic, or a local Qwen server)
  • Benchmark datasets are not bundled — prepare your own splits (see §4)

2.2 Install the Package #

Option A — from PyPI:

pip install skillopt

# Optional extras:
pip install skillopt[alfworld]   # ALFWorld benchmark
pip install skillopt[webui]      # Gradio monitoring dashboard
pip install skillopt[claude]     # Claude model backend

Option B — from source (for development):

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

# Optional extras (install only what you need):
pip install -e ".[alfworld]"   # ALFWorld benchmark
pip install -e ".[claude]"     # Anthropic Claude backend
pip install -e ".[qwen]"       # local Qwen backend
pip install -e ".[webui]"      # monitoring dashboard

# ALFWorld also needs its data assets:
alfworld-download

2.3 Configure Credentials #

Copy the template and fill in whichever backend you will use:

cp .env.example .env
# edit .env, then:
set -a; source .env; set +a
One env-var family for all OpenAI modes

SkillOpt reuses the AZURE_OPENAI_* variable names even for plain OpenAI — there is no separate OPENAI_API_KEY knob. AZURE_OPENAI_ENDPOINT is required for every OpenAI auth mode.

Azure OpenAI (default)

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-12-01-preview"
# Auth option 1 — API key:
export AZURE_OPENAI_API_KEY="your-key"
# Auth option 2 — Azure CLI (no key; recommended on Azure VMs):
export AZURE_OPENAI_AUTH_MODE=azure_cli
# Auth option 3 — Managed Identity:
export AZURE_OPENAI_AUTH_MODE=managed_identity
export AZURE_OPENAI_MANAGED_IDENTITY_CLIENT_ID="your-client-id"

OpenAI-compatible endpoint

export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
export AZURE_OPENAI_API_KEY="sk-..."
export AZURE_OPENAI_AUTH_MODE=openai_compatible

Anthropic Claude / local Qwen

export ANTHROPIC_API_KEY="sk-ant-..."          # claude_chat backend

export QWEN_CHAT_BASE_URL="http://localhost:8000/v1" # local vLLM
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"

2.4 Verify Installation #

python -c "import skillopt; print('SkillOpt ready!')"

3.1 Your First Demo #

What ships in this repo: ready-to-use configs and pretrained skills (ckpt/) for six benchmarks, plus lightweight ID manifests under data/. The manifests pin exactly which examples each split uses but do not contain the example contents — so you materialize the data once before the first run.

Step 1 — materialize the SearchQA splits (one-time; downloads the ~6.5 GB source dataset). The manifest IDs match the key field of the lucadiliello/searchqa dataset:

pip install datasets
python - <<'PY'
import json, os
from datasets import load_dataset

ds = load_dataset("lucadiliello/searchqa")
by_key = {r["key"]: r for split in ds.values() for r in split}

for split in ["train", "val", "test"]:
    ids = json.load(open(f"data/searchqa_id_split/{split}/items.json"))
    items = []
    for x in ids:
        r = by_key[x["id"]]
        items.append({"id": r["key"], "question": r["question"],
                      "context": r["context"], "answers": r["answers"]})
    os.makedirs(f"data/searchqa_split/{split}", exist_ok=True)
    json.dump(items, open(f"data/searchqa_split/{split}/items.json", "w"))
    print(split, len(items))
PY

Step 2 — train (4 epochs × batch 40; see §3.2 for the CLI reference):

python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir data/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

Other benchmarks follow the same pattern — materialize from the raw source listed in data/README.md (it documents the lookup key per benchmark), then point --split_dir at the result. The one exception is ALFWorld, whose bundled data/alfworld_path_split works directly: just pip install -e ".[alfworld]" && alfworld-download and set $ALFWORLD_DATA.

To sanity-check your setup without training, evaluate a packaged pretrained skill instead (§3.3 uses ckpt/searchqa/gpt5.5_skill.md), or launch the monitoring WebUI (§8.4).

3.2 Train a Skill #

# Minimal SearchQA run
python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

Swap the config for another benchmark (e.g. configs/livemathematicianbench/default.yaml, configs/alfworld/default.yaml). Common CLI arguments:

ArgumentDescription
--configBenchmark config YAML (required)
--split_dirPath to the data split directory
--azure_openai_endpointAzure OpenAI endpoint URL
--optimizer_model / --target_modelDeployment names for optimizer / target
--num_epochs / --batch_sizeEpochs and rollout batch size
--out_rootOutput directory
--cfg-options k=v ...Override any config key (see §6.1)

3.3 Evaluate a Skill #

Evaluate any skill document (a packaged reference skill, or a trained run's best_skill.md) without training:

# Evaluate the packaged GPT-5.5 SearchQA skill on the test split
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill ckpt/searchqa/gpt5.5_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/
--splitMeaning
valid_unseenTest set (held-out)
valid_seenValidation / selection set
trainTraining set
allAll splits combined (default)

3.4 Output Structure #

outputs/<run_name>/
 ├─ config.json          # flattened runtime config
 ├─ history.json         # per-step training history
 ├─ runtime_state.json   # resume checkpoint
 ├─ best_skill.md        # best validated skill document
 ├─ skills/skill_vXXXX.md# skill snapshot per step
 ├─ steps/step_XXXX/     # per-step artifacts (patches, evals)
 ├─ slow_update/epoch_XX/# slow-update logs & rollouts
 └─ meta_skill/epoch_XX/ # meta-skill logs

3.5 Auto-Resume #

Each completed step persists its state to runtime_state.json and a steps/step_XXXX/ directory. Re-running the same command against the same out_root detects finished work and continues from the last completed step — including epoch-boundary slow-update and meta-skill stages.

4.1 Split Directory Format #

Bringing your own dataset takes three steps: (1) create a split directory with train/ val/ test/ item files in the format below; (2) make sure each item carries the fields the closest existing benchmark adapter expects (§4.2); (3) point --split_dir at it and train with that benchmark's config. If no existing adapter matches your task shape (different rollout or scoring logic), write a new benchmark adapter instead — see §7.2.

With env.split_mode: split_dir (the recommended, deterministic mode), SkillOpt reads a directory containing train/, val/, and test/ subfolders, each holding a JSON array of task items:

data/my_split/
 ├─ train/items.json   # used for rollout (the "train split")
 ├─ val/items.json     # selection split → validation gate (valid_seen)
 └─ test/items.json    # held-out final eval (valid_unseen)
Split naming

Internally the splits are referred to as train, valid_seen (validation/selection), and valid_unseen (test). The --split flag of eval_only.py uses these names.

4.2 Item JSON Schema #

Required fields depend on the benchmark; consult skillopt/envs/<benchmark>/dataloader.py for the exact contract. A SearchQA item, for example:

[
  {
    "id":       "unique_item_id",
    "question": "Who wrote the novel ...",
    "context":  "[DOC] relevant passage text ...",
    "answers":  ["expected answer"]
  }
]
Datasets not included

This repository ships no benchmark data. Prepare your own splits in the format above before training.

4.3 Split Modes #

env.split_modeBehavior
split_dirUse a pre-built directory with explicit train/val/test folders (set env.split_dir). Deterministic and reproducible.
ratioBuild a deterministic split on the fly from a single env.data_path, using split_seed (and a train:val:test ratio). Convenient for quick experiments.

5.1 The Training Loop #

The loop lives in ReflACTTrainer (skillopt/engine/trainer.py). Each epoch runs a series of optimization steps over rollout batches, then performs two epoch-boundary stages.

for epoch in epochs:
    for step in steps:
        1. Rollout    # target executes a batch of tasks
        2. Reflect    # optimizer analyzes trajectories → edit patches
        3. Aggregate  # hierarchically merge similar patches
        4. Select     # rank & clip edits to the learning rate
        5. Update     # apply patches → candidate skill
        6. Gate       # score on selection split → accept / reject

    # epoch boundary (from epoch 2 onward)
    Slow update   # longitudinal comparison → protected guidance
    Meta skill    # cross-epoch optimizer memory

5.2 The Six Per-Step Stages #

StageWhat happensSource
1. RolloutThe target model runs each task in the batch with the current skill as context, producing trajectories and scores.envs/<b>/rollout.py
2. ReflectThe optimizer runs an error analyst (and optional success analyst) over minibatches of trajectories, emitting structured edit patches. Runs in parallel across analyst_workers.gradient/reflect.py
3. AggregateSemantically similar patches are merged hierarchically to remove redundancy.gradient/aggregate.pymerge_patches
4. SelectPatches are ranked and clipped to the current learning rate (max edits this step), set by the scheduler.optimizer/clip.pyrank_and_select
5. UpdateSelected edits are applied to the skill document, producing a candidate skill (patch / rewrite modes).optimizer/skill.py, update_modes.py
6. GateThe candidate is scored on the selection split and accepted only if it improves (see §5.3).evaluation/gate.pyevaluate_gate

5.3 Validation Gate #

evaluate_gate is a pure decision function. It compares the candidate's selection-set score against the current and best skills:

  • accept_new_best — candidate > current and candidate > best → becomes both current and best.
  • accept — candidate > current but ≤ best → becomes current only.
  • reject — candidate ≤ current → discarded; current/best unchanged.

The comparison metric is configurable via evaluation.gate_metric:

MetricScore used
hard defaultExact-match / discrete score
softPartial-credit / continuous score
mixedWeighted blend, controlled by gate_mixed_weight
When to use soft/mixed

The soft/mixed metrics (contributed config configs/examples/soft_gate.yaml) help when the selection split is small and rewards are continuous, where a discrete hard gate may reject every candidate and stall training. Paper numbers use the default hard gate.

5.4 Slow Update (Momentum) #

At each epoch boundary (from epoch 2), the slow update rolls out both the previous epoch's skill and the current skill on the same sampled tasks, categorizes items (improved / regressed / persistent-fail / stable-success), and asks the optimizer to write a free-form guidance block. This guidance lands in a protected region of the skill that step-level edits cannot touch — only the slow update overwrites it. It is SkillOpt's analogue of momentum, countering cross-epoch forgetting.

Acceptance has two modes, selected by optimizer.slow_update_gate_with_selection:

ModeBehavior
false default — force-injectedGuidance is injected into both current and best skills unconditionally. The longitudinal guidance always persists; it is not gated by step-level selection scores.
true — gatedThe slow-update candidate is scored on the selection split and accepted/rejected through the same validation gate as step-level updates.

5.5 Meta Skill (Optimizer Memory) #

The meta skill is optimizer-side memory — it never modifies the target skill document. At the end of each epoch (skipped for epoch 1), the optimizer compares the previous and current epoch's last-step skills on the same sampled tasks and writes a compact, evidence-based reflection on what kind of edits helped or hurt. That memory is then injected as extra context into the next epoch's reflect / merge / learning-rate / ranking stages, so the optimizer accumulates strategy across the run.

5.6 Skill Document Anatomy #

A skill document is plain Markdown. Initial skills can be empty (learn from scratch) or seeded with domain knowledge via env.skill_init. During training the document accrues rules, patterns, and edge-case handling through accepted edit patches. A dedicated protected region holds the slow-update guidance, delimited by HTML-comment markers:

# Question Answering Skill

## Learned rules ...
- When the context contains multiple candidates, prefer ...

<!-- SLOW_UPDATE_START -->
# (epoch-level longitudinal guidance — only the slow update writes here)
<!-- SLOW_UPDATE_END -->

Helpers in optimizer/slow_update.py manage this region: inject_empty_slow_update_field (placeholder at epoch 1), extract_slow_update_field (read), and replace_slow_update_field (overwrite). Step-level edits are blocked from modifying anything inside the markers.

6.1 Configuration System #

Configs are structured YAML with section blocks (model, train, gradient, optimizer, evaluation, env) and _base_ inheritance. A benchmark config inherits the shared defaults and overrides only what differs:

# configs/searchqa/default.yaml
_base_: ../_base_/default.yaml
train:
  train_size: 400
  batch_size: 40
optimizer:
  learning_rate: 4
env:
  name: searchqa
  split_dir: data/searchqa_split

Override any key at the command line without editing files:

python scripts/train.py --config configs/searchqa/default.yaml \
  --cfg-options optimizer.learning_rate=16 optimizer.lr_scheduler=linear
Reading the tables below

Each section lists the key (relative to its YAML block), type, default (from configs/_base_/default.yaml), allowed values, and meaning. Defaults shown are the shipped base defaults.

6.2 model.* #

KeyTypeDefaultDescription / options
backendstrazure_openaiHigh-level backend label for the run.
optimizerstrgpt-5.5Optimizer model deployment (writes skill edits).
targetstrgpt-5.5Target model deployment (executes tasks).
optimizer_backendstropenai_chatClient path for the optimizer: openai_chat or claude_chat.
target_backendstropenai_chatClient path for the target: openai_chat / claude_chat / qwen_chat / codex_exec / claude_code_exec.
reasoning_effortstrmediumlow / medium / high / xhigh / max (or empty).
rewrite_reasoning_effortstr""Override effort for full-rewrite calls (empty = inherit).
rewrite_max_completion_tokensint64000Token cap for full-rewrite optimizer calls.
azure_openai_endpointstr""Azure resource URL (or via AZURE_OPENAI_ENDPOINT).
azure_openai_api_versionstr2024-12-01-previewAzure API version header.
azure_openai_auth_modestr""api_key / azure_cli / managed_identity / openai_compatible (empty → env default).
Separate optimizer / target endpoints

Every azure_openai_* key also has optimizer_azure_openai_* and target_azure_openai_* variants, letting you point the optimizer and target at different Azure resources. Exec backends (codex_exec, claude_code_exec) add their own codex_exec_* / claude_code_exec_* knobs (sandbox, reasoning effort, SDK mode, etc.).

6.3 train.* #

KeyTypeDefaultDL analogyDescription
num_epochsint4EpochsNumber of training epochs.
train_sizeint0Train-set size0 = derive from the dataset split. (Fixed by split size when using split_dir.)
batch_sizeint40Batch sizeTasks rolled out per optimization step.
accumulationint1Grad accumulationAccumulation rounds per step.
seedint42Random seedReproducibility seed.

6.4 gradient.* #

KeyTypeDefaultDescription
minibatch_sizeint8Trajectories per reflect minibatch.
merge_batch_sizeint8Patches per merge batch during aggregation.
analyst_workersint16Parallel reflection workers (data parallelism).
max_analyst_roundsint3Max rounds of analyst reflection per step.
failure_onlyboolfalseReflect only on failed trajectories when true.

6.5 optimizer.* #

KeyTypeDefaultDL analogyDescription / options
learning_rateint4Learning rateMax edit patches applied per step (the "edit budget").
min_learning_rateint2Min LRFloor edit budget for decaying schedulers.
lr_schedulerstrcosineLR scheduleconstant / linear / cosine / autonomous.
lr_control_modestrfixedfixed / autonomous / none.
skill_update_modestrpatchpatch / rewrite_from_suggestions / full_rewrite_minibatch.
use_slow_updatebooltrueMomentumEnable epoch-boundary slow update.
slow_update_samplesint20Tasks sampled for the longitudinal comparison.
slow_update_gate_with_selectionboolfalsefalse = force-inject guidance; true = gate it on the selection split (see §5.4).
longitudinal_pair_policystrmixedmixed / changed / unchanged — which comparison pairs to keep.
use_meta_skillbooltrueMeta-learningEnable cross-epoch optimizer memory.
use_skill_aware_reflectionboolfalseEmbodiSkill-style failure routing: SKILL_DEFECT (rule wrong/missing → gated body edit) vs EXECUTION_LAPSE (valid rule not followed → reminder appended to a protected appendix region that step-level edits never modify). Off = baseline-identical; resolved process-wide, works on every benchmark. Not supported with rewrite_from_suggestions / full-rewrite modes.
skill_aware_appendix_sourcestrbothboth (success analyst may also re-emphasize rules) / failure_only (paper-faithful S_app: failure side only).
skill_aware_consolidate_thresholdint0>0: LLM-compact the appendix once it exceeds N notes (experimental); 0 = off.

6.6 evaluation.* #

KeyTypeDefaultDescription / options
use_gatebooltrueValidation gating is mandatory in this branch (must remain true).
gate_metricstrhardhard / soft / mixed — score used by the gate (see §5.3).
gate_mixed_weightfloat0.5Weight on the soft score when gate_metric = mixed.
sel_env_numint0Selection-split eval size (0 = use full split).
test_env_numint0Test-split eval size (0 = use full split).
eval_testbooltrueRun a final test evaluation after training.
Gate is required

Setting evaluation.use_gate: false raises an error — validation gating cannot be disabled in this branch.

6.7 env.* #

KeyTypeDefaultDescription
namestr""Benchmark name (searchqa, docvqa, alfworld, …). Selects the env module.
skill_initstr""Path to a seed skill (empty = start from scratch).
split_modestrratioratio or split_dir (see §4.3).
split_dirstr""Pre-split directory (when split_mode = split_dir).
data_pathstr""Single dataset path (when split_mode = ratio).
split_seedint42Seed for deterministic ratio splitting.
exec_timeoutint120Per-task target/code-agent timeout (seconds).
out_rootstr""Output directory for the run.
Benchmark-specific env keys

Env blocks may carry extra benchmark-specific keys (e.g. max_turns, workers, max_completion_tokens, limit). Unmapped env keys are passed straight through to the benchmark adapter — check the relevant configs/<benchmark>/default.yaml.

7.1 Supported Benchmarks #

BenchmarkTypeConfig
SearchQAQuestion answeringconfigs/searchqa/default.yaml
DocVQADocument QAconfigs/docvqa/default.yaml
ALFWorldEmbodied agentconfigs/alfworld/default.yaml
LiveMathematicianBenchMath reasoningconfigs/livemathematicianbench/default.yaml
SpreadsheetBenchSpreadsheet code generationconfigs/spreadsheetbench/default.yaml
OfficeQATool-augmented QAconfigs/officeqa/default.yaml

Each benchmark is a self-contained module under skillopt/envs/<benchmark>/ with an adapter.py, dataloader.py, rollout.py, and evaluator.py (some add a custom reflect.py). Packaged reference skills live in ckpt/<benchmark>/.

7.2 Add a New Benchmark #

Use skillopt/envs/_template/ as a starting point. At minimum, implement:

  1. Dataloader — read your item JSON into the framework's item dicts (dataloader.py).
  2. Rollout — run the target on one item with the current skill and return a trajectory + score (rollout.py).
  3. Evaluator — score predictions against ground truth (evaluator.py).
  4. Adapter — wire the above into the trainer's expected interface and register the env name (adapter.py).

Then add a configs/<name>/default.yaml inheriting _base_/default.yaml and set env.name to your new benchmark.

8.1 Module Map #

ModuleResponsibility
skillopt/config.pyLoad structured YAML, resolve _base_ inheritance, flatten to the trainer's flat dict, apply CLI overrides.
skillopt/engine/trainer.pyReflACTTrainer — orchestrates the whole loop, gating, slow update, meta skill, resume, and artifact writing.
skillopt/gradient/Reflection ("backward pass"): reflect.py analysts, aggregate.py patch merging.
skillopt/optimizer/The "optimizer": edit application, learning-rate scheduling, edit selection, slow update, meta skill, rewrite modes.
skillopt/evaluation/gate.pyPure accept/reject decision and metric selection.
skillopt/model/Backend clients (OpenAI/Azure, Claude, Qwen, Codex/Claude-Code exec) and routing.
skillopt/envs/<b>/Per-benchmark dataloader, rollout, evaluator, adapter.

8.2 Core Functions #

FunctionFilePurpose
load_config / flatten_config / apply_overridesconfig.pyLoad YAML with inheritance; flatten sections; apply key=value overrides.
run_minibatch_reflectgradient/reflect.pyRun error/success analysts over trajectory minibatches → edit patches.
merge_patchesgradient/aggregate.pyHierarchically merge semantically similar patches.
rank_and_selectoptimizer/clip.pyRank edits and clip to the learning-rate budget.
build_scheduleroptimizer/scheduler.pyConstruct the LR (edit-budget) scheduler: constant/linear/cosine/autonomous.
decide_autonomous_learning_rateoptimizer/lr_autonomous.pyLet the optimizer pick the next learning rate (autonomous mode).
apply_patch / apply_editoptimizer/skill.pyApply edits to the skill document (respecting the protected region).
rewrite_skill_from_suggestionsoptimizer/rewrite.pyFull-rewrite update mode from accumulated suggestions.
evaluate_gate / select_gate_scoreevaluation/gate.pyAccept/reject decision; compute hard/soft/mixed score.
run_slow_updateoptimizer/slow_update.pyProduce epoch-boundary longitudinal guidance.
replace_slow_update_field / extract_slow_update_fieldoptimizer/slow_update.pyRead/overwrite the protected guidance region.
run_meta_skill / format_meta_skill_contextoptimizer/meta_skill.pyGenerate cross-epoch optimizer memory and render it into reflection context.

8.3 CLI Scripts #

scripts/train.py

Runs a full training loop. Required: --config. Override config via --cfg-options section.key=value … or legacy flat flags (--num_epochs, --batch_size, --optimizer_model, --target_model, --lr_scheduler, --edit_budget, --split_dir, …).

scripts/eval_only.py

Evaluates a skill document without training. Required: --config and --skill. Use --split to choose train / valid_seen / valid_unseen / all.

python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen

8.4 WebUI #

An optional Gradio dashboard to configure parameters and monitor runs:

pip install -e ".[webui]"
python -m skillopt_webui.app          # http://localhost:7860
python -m skillopt_webui.app --share  # public share link
FlagDefaultDescription
--port7860Server port.
--host0.0.0.0Bind address.
--shareoffCreate a public Gradio share link.