SkillOpt Documentation & Reproduction Guide
Train agent skills like you train neural networks — with epochs, (mini-)batch size, learning rates, and validation gates — but without touching any model weights.
This guide walks you from a clean checkout to a reproduced result and a full reference for every configuration knob and core function. It is generated from, and kept consistent with, the current state of the codebase.
1.1 What is SkillOpt #
SkillOpt is a text-space optimizer that improves a frozen language agent by iteratively editing a natural-language skill document — never the model weights. The skill document is a Markdown file that conditions a target model as it executes tasks. SkillOpt treats this document as the "weights" and runs a training loop that mirrors deep-learning training: rollout (forward pass), reflect (backward pass / gradients), select & apply edits (optimizer step), and a validation gate (accept/reject).
Two roles split every model call:
- Target — executes tasks using the current skill document (the agent being improved).
- Optimizer — analyzes the target's trajectories and proposes edits to the skill document.
The same loop drives six benchmarks out of the box (QA, document QA, embodied agents, math, spreadsheet code generation, and tool-augmented QA).
1.2 Deep-Learning ↔ SkillOpt Analogy #
Every concept below maps to a concrete code construct, so deep-learning intuitions transfer directly to hyperparameter tuning.
| Deep learning | SkillOpt | Where it lives |
|---|---|---|
| Model weights | Skill document (Markdown) | skillopt/optimizer/skill.py |
| Forward pass | Rollout — target runs tasks | envs/<bench>/rollout.py |
| Loss / score | Task evaluator | envs/<bench>/evaluator.py |
| Backprop / gradients | Reflect → edit patches | gradient/reflect.py |
| Gradient aggregation | Hierarchical patch merge | gradient/aggregate.py |
| Gradient clipping | Rank & select top-k edits | optimizer/clip.py |
| Learning rate | optimizer.learning_rate (edits/step) | optimizer/scheduler.py |
| LR scheduler | lr_scheduler (cosine/linear/…) | optimizer/scheduler.py |
| Optimizer step | Apply patches to the document | optimizer/skill.py |
| Validation set | Selection split (valid_seen) | evaluation/gate.py |
| Early stopping / accept | Validation gate | evaluation/gate.py |
| Momentum | Slow update (epoch boundary) | optimizer/slow_update.py |
| Meta-learning | Meta skill (cross-epoch memory) | optimizer/meta_skill.py |
| Batch / minibatch | batch_size / minibatch_size | engine/trainer.py |
| Epoch | Epoch (+ slow update & meta skill) | engine/trainer.py |
Cosine schedule tends to beat constant; moderate learning rates (≈4–16 edits/step) beat very high/low; slow update curbs cross-epoch forgetting; meta-skill memory improves reflection quality. Conversely, bigger rollout batches and many epochs show diminishing returns — skills converge in ~2–4 epochs.
1.3 Key Features #
Validation gating
Every candidate skill is scored on a held-out selection split and only accepted if it beats the current/best skill.
Slow update
Epoch-boundary longitudinal comparison writes guidance into a protected region — momentum against forgetting. Force-injected or selection-gated.
Meta skill
Optimizer-side memory that reflects on what worked across epochs and feeds back into reflection.
Pluggable backends
OpenAI / Azure OpenAI, Anthropic Claude, local Qwen (vLLM), plus Codex/Claude-Code exec backends for the target.
Six benchmarks
SearchQA, DocVQA, ALFWorld, LiveMathematicianBench, SpreadsheetBench, OfficeQA — each a self-contained env module.
Auto-resume
Every run is checkpointed step-by-step; re-running the same command continues from the last completed step.
1.4 Repository Layout #
# top level
configs/ # YAML configs (_base_ + per-benchmark)
scripts/ # train.py, eval_only.py CLIs
ckpt/ # packaged reference skills (e.g. gpt5.5_skill.md)
docs/ # this guide + mkdocs sources
skillopt/ # the package
├─ config.py # YAML loading, _base_ inheritance, flatten
├─ engine/trainer.py# the training loop (ReflACTTrainer)
├─ gradient/ # reflect.py (analyst), aggregate.py (merge)
├─ optimizer/ # skill edits, scheduler, clip, slow_update, meta_skill
├─ evaluation/ # gate.py (accept/reject logic)
├─ model/ # backend clients + routing
└─ envs/<benchmark>/ # adapter, dataloader, rollout, evaluator, reflect
2.1 Requirements #
- Python ≥ 3.10
- Credentials for at least one model backend (Azure OpenAI, OpenAI-compatible, Anthropic, or a local Qwen server)
- Benchmark datasets are not bundled — prepare your own splits (see §4)
2.2 Install the Package #
Option A — from PyPI:
pip install skillopt
# Optional extras:
pip install skillopt[alfworld] # ALFWorld benchmark
pip install skillopt[webui] # Gradio monitoring dashboard
pip install skillopt[claude] # Claude model backend
Option B — from source (for development):
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .
# Optional extras (install only what you need):
pip install -e ".[alfworld]" # ALFWorld benchmark
pip install -e ".[claude]" # Anthropic Claude backend
pip install -e ".[qwen]" # local Qwen backend
pip install -e ".[webui]" # monitoring dashboard
# ALFWorld also needs its data assets:
alfworld-download
2.3 Configure Credentials #
Copy the template and fill in whichever backend you will use:
cp .env.example .env
# edit .env, then:
set -a; source .env; set +a
SkillOpt reuses the AZURE_OPENAI_* variable names even for plain OpenAI — there is no separate OPENAI_API_KEY knob. AZURE_OPENAI_ENDPOINT is required for every OpenAI auth mode.
Azure OpenAI (default)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-12-01-preview"
# Auth option 1 — API key:
export AZURE_OPENAI_API_KEY="your-key"
# Auth option 2 — Azure CLI (no key; recommended on Azure VMs):
export AZURE_OPENAI_AUTH_MODE=azure_cli
# Auth option 3 — Managed Identity:
export AZURE_OPENAI_AUTH_MODE=managed_identity
export AZURE_OPENAI_MANAGED_IDENTITY_CLIENT_ID="your-client-id"
OpenAI-compatible endpoint
export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
export AZURE_OPENAI_API_KEY="sk-..."
export AZURE_OPENAI_AUTH_MODE=openai_compatible
Anthropic Claude / local Qwen
export ANTHROPIC_API_KEY="sk-ant-..." # claude_chat backend
export QWEN_CHAT_BASE_URL="http://localhost:8000/v1" # local vLLM
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"
2.4 Verify Installation #
python -c "import skillopt; print('SkillOpt ready!')"
3.1 Your First Demo #
What ships in this repo: ready-to-use configs and
pretrained skills (ckpt/) for six benchmarks, plus
lightweight ID manifests under data/. The manifests
pin exactly which examples each split uses but do not
contain the example contents — so you materialize the data once before
the first run.
Step 1 — materialize the SearchQA splits (one-time; downloads the ~6.5 GB source dataset). The manifest IDs match the key field of the
lucadiliello/searchqa
dataset:
pip install datasets
python - <<'PY'
import json, os
from datasets import load_dataset
ds = load_dataset("lucadiliello/searchqa")
by_key = {r["key"]: r for split in ds.values() for r in split}
for split in ["train", "val", "test"]:
ids = json.load(open(f"data/searchqa_id_split/{split}/items.json"))
items = []
for x in ids:
r = by_key[x["id"]]
items.append({"id": r["key"], "question": r["question"],
"context": r["context"], "answers": r["answers"]})
os.makedirs(f"data/searchqa_split/{split}", exist_ok=True)
json.dump(items, open(f"data/searchqa_split/{split}/items.json", "w"))
print(split, len(items))
PY
Step 2 — train (4 epochs × batch 40; see §3.2 for the CLI reference):
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_dir data/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5
Other benchmarks follow the same pattern — materialize from the raw
source listed in
data/README.md
(it documents the lookup key per benchmark), then point
--split_dir at the result. The one exception is
ALFWorld, whose bundled
data/alfworld_path_split works directly: just
pip install -e ".[alfworld]" && alfworld-download and
set $ALFWORLD_DATA.
To sanity-check your setup without training, evaluate a
packaged pretrained skill instead (§3.3 uses
ckpt/searchqa/gpt5.5_skill.md), or launch the monitoring
WebUI (§8.4).
3.2 Train a Skill #
# Minimal SearchQA run
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_dir /path/to/your/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5
Swap the config for another benchmark (e.g. configs/livemathematicianbench/default.yaml, configs/alfworld/default.yaml). Common CLI arguments:
| Argument | Description |
|---|---|
--config | Benchmark config YAML (required) |
--split_dir | Path to the data split directory |
--azure_openai_endpoint | Azure OpenAI endpoint URL |
--optimizer_model / --target_model | Deployment names for optimizer / target |
--num_epochs / --batch_size | Epochs and rollout batch size |
--out_root | Output directory |
--cfg-options k=v ... | Override any config key (see §6.1) |
3.3 Evaluate a Skill #
Evaluate any skill document (a packaged reference skill, or a trained run's best_skill.md) without training:
# Evaluate the packaged GPT-5.5 SearchQA skill on the test split
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill ckpt/searchqa/gpt5.5_skill.md \
--split valid_unseen \
--split_dir /path/to/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/
--split | Meaning |
|---|---|
valid_unseen | Test set (held-out) |
valid_seen | Validation / selection set |
train | Training set |
all | All splits combined (default) |
3.4 Output Structure #
outputs/<run_name>/
├─ config.json # flattened runtime config
├─ history.json # per-step training history
├─ runtime_state.json # resume checkpoint
├─ best_skill.md # best validated skill document
├─ skills/skill_vXXXX.md# skill snapshot per step
├─ steps/step_XXXX/ # per-step artifacts (patches, evals)
├─ slow_update/epoch_XX/# slow-update logs & rollouts
└─ meta_skill/epoch_XX/ # meta-skill logs
3.5 Auto-Resume #
Each completed step persists its state to runtime_state.json and a steps/step_XXXX/ directory. Re-running the same command against the same out_root detects finished work and continues from the last completed step — including epoch-boundary slow-update and meta-skill stages.
4.1 Split Directory Format #
Bringing your own dataset takes three steps:
(1) create a split directory with train/ val/ test/ item
files in the format below; (2) make sure each item carries the fields
the closest existing benchmark adapter expects (§4.2); (3) point
--split_dir at it and train with that benchmark's config.
If no existing adapter matches your task shape (different rollout or
scoring logic), write a new benchmark adapter instead — see §7.2.
With env.split_mode: split_dir (the recommended, deterministic mode), SkillOpt reads a directory containing train/, val/, and test/ subfolders, each holding a JSON array of task items:
data/my_split/
├─ train/items.json # used for rollout (the "train split")
├─ val/items.json # selection split → validation gate (valid_seen)
└─ test/items.json # held-out final eval (valid_unseen)
Internally the splits are referred to as train, valid_seen (validation/selection), and valid_unseen (test). The --split flag of eval_only.py uses these names.
4.2 Item JSON Schema #
Required fields depend on the benchmark; consult skillopt/envs/<benchmark>/dataloader.py for the exact contract. A SearchQA item, for example:
[
{
"id": "unique_item_id",
"question": "Who wrote the novel ...",
"context": "[DOC] relevant passage text ...",
"answers": ["expected answer"]
}
]
This repository ships no benchmark data. Prepare your own splits in the format above before training.
4.3 Split Modes #
env.split_mode | Behavior |
|---|---|
split_dir | Use a pre-built directory with explicit train/val/test folders (set env.split_dir). Deterministic and reproducible. |
ratio | Build a deterministic split on the fly from a single env.data_path, using split_seed (and a train:val:test ratio). Convenient for quick experiments. |
5.1 The Training Loop #
The loop lives in ReflACTTrainer (skillopt/engine/trainer.py). Each epoch runs a series of optimization steps over rollout batches, then performs two epoch-boundary stages.
for epoch in epochs:
for step in steps:
1. Rollout # target executes a batch of tasks
2. Reflect # optimizer analyzes trajectories → edit patches
3. Aggregate # hierarchically merge similar patches
4. Select # rank & clip edits to the learning rate
5. Update # apply patches → candidate skill
6. Gate # score on selection split → accept / reject
# epoch boundary (from epoch 2 onward)
Slow update # longitudinal comparison → protected guidance
Meta skill # cross-epoch optimizer memory
5.2 The Six Per-Step Stages #
| Stage | What happens | Source |
|---|---|---|
| 1. Rollout | The target model runs each task in the batch with the current skill as context, producing trajectories and scores. | envs/<b>/rollout.py |
| 2. Reflect | The optimizer runs an error analyst (and optional success analyst) over minibatches of trajectories, emitting structured edit patches. Runs in parallel across analyst_workers. | gradient/reflect.py |
| 3. Aggregate | Semantically similar patches are merged hierarchically to remove redundancy. | gradient/aggregate.py → merge_patches |
| 4. Select | Patches are ranked and clipped to the current learning rate (max edits this step), set by the scheduler. | optimizer/clip.py → rank_and_select |
| 5. Update | Selected edits are applied to the skill document, producing a candidate skill (patch / rewrite modes). | optimizer/skill.py, update_modes.py |
| 6. Gate | The candidate is scored on the selection split and accepted only if it improves (see §5.3). | evaluation/gate.py → evaluate_gate |
5.3 Validation Gate #
evaluate_gate is a pure decision function. It compares the candidate's selection-set score against the current and best skills:
- accept_new_best — candidate > current and candidate > best → becomes both current and best.
- accept — candidate > current but ≤ best → becomes current only.
- reject — candidate ≤ current → discarded; current/best unchanged.
The comparison metric is configurable via evaluation.gate_metric:
| Metric | Score used |
|---|---|
hard default | Exact-match / discrete score |
soft | Partial-credit / continuous score |
mixed | Weighted blend, controlled by gate_mixed_weight |
The soft/mixed metrics (contributed config configs/examples/soft_gate.yaml) help when the selection split is small and rewards are continuous, where a discrete hard gate may reject every candidate and stall training. Paper numbers use the default hard gate.
5.4 Slow Update (Momentum) #
At each epoch boundary (from epoch 2), the slow update rolls out both the previous epoch's skill and the current skill on the same sampled tasks, categorizes items (improved / regressed / persistent-fail / stable-success), and asks the optimizer to write a free-form guidance block. This guidance lands in a protected region of the skill that step-level edits cannot touch — only the slow update overwrites it. It is SkillOpt's analogue of momentum, countering cross-epoch forgetting.
Acceptance has two modes, selected by optimizer.slow_update_gate_with_selection:
| Mode | Behavior |
|---|---|
false default — force-injected | Guidance is injected into both current and best skills unconditionally. The longitudinal guidance always persists; it is not gated by step-level selection scores. |
true — gated | The slow-update candidate is scored on the selection split and accepted/rejected through the same validation gate as step-level updates. |
5.5 Meta Skill (Optimizer Memory) #
The meta skill is optimizer-side memory — it never modifies the target skill document. At the end of each epoch (skipped for epoch 1), the optimizer compares the previous and current epoch's last-step skills on the same sampled tasks and writes a compact, evidence-based reflection on what kind of edits helped or hurt. That memory is then injected as extra context into the next epoch's reflect / merge / learning-rate / ranking stages, so the optimizer accumulates strategy across the run.
5.6 Skill Document Anatomy #
A skill document is plain Markdown. Initial skills can be empty (learn from scratch) or seeded with domain knowledge via env.skill_init. During training the document accrues rules, patterns, and edge-case handling through accepted edit patches. A dedicated protected region holds the slow-update guidance, delimited by HTML-comment markers:
# Question Answering Skill
## Learned rules ...
- When the context contains multiple candidates, prefer ...
<!-- SLOW_UPDATE_START -->
# (epoch-level longitudinal guidance — only the slow update writes here)
<!-- SLOW_UPDATE_END -->
Helpers in optimizer/slow_update.py manage this region: inject_empty_slow_update_field (placeholder at epoch 1), extract_slow_update_field (read), and replace_slow_update_field (overwrite). Step-level edits are blocked from modifying anything inside the markers.
6.1 Configuration System #
Configs are structured YAML with section blocks (model, train, gradient, optimizer, evaluation, env) and _base_ inheritance. A benchmark config inherits the shared defaults and overrides only what differs:
# configs/searchqa/default.yaml
_base_: ../_base_/default.yaml
train:
train_size: 400
batch_size: 40
optimizer:
learning_rate: 4
env:
name: searchqa
split_dir: data/searchqa_split
Override any key at the command line without editing files:
python scripts/train.py --config configs/searchqa/default.yaml \
--cfg-options optimizer.learning_rate=16 optimizer.lr_scheduler=linear
Each section lists the key (relative to its YAML block), type, default (from configs/_base_/default.yaml), allowed values, and meaning. Defaults shown are the shipped base defaults.
6.2 model.* #
| Key | Type | Default | Description / options |
|---|---|---|---|
backend | str | azure_openai | High-level backend label for the run. |
optimizer | str | gpt-5.5 | Optimizer model deployment (writes skill edits). |
target | str | gpt-5.5 | Target model deployment (executes tasks). |
optimizer_backend | str | openai_chat | Client path for the optimizer: openai_chat or claude_chat. |
target_backend | str | openai_chat | Client path for the target: openai_chat / claude_chat / qwen_chat / codex_exec / claude_code_exec. |
reasoning_effort | str | medium | low / medium / high / xhigh / max (or empty). |
rewrite_reasoning_effort | str | "" | Override effort for full-rewrite calls (empty = inherit). |
rewrite_max_completion_tokens | int | 64000 | Token cap for full-rewrite optimizer calls. |
azure_openai_endpoint | str | "" | Azure resource URL (or via AZURE_OPENAI_ENDPOINT). |
azure_openai_api_version | str | 2024-12-01-preview | Azure API version header. |
azure_openai_auth_mode | str | "" | api_key / azure_cli / managed_identity / openai_compatible (empty → env default). |
Every azure_openai_* key also has optimizer_azure_openai_* and target_azure_openai_* variants, letting you point the optimizer and target at different Azure resources. Exec backends (codex_exec, claude_code_exec) add their own codex_exec_* / claude_code_exec_* knobs (sandbox, reasoning effort, SDK mode, etc.).
6.3 train.* #
| Key | Type | Default | DL analogy | Description |
|---|---|---|---|---|
num_epochs | int | 4 | Epochs | Number of training epochs. |
train_size | int | 0 | Train-set size | 0 = derive from the dataset split. (Fixed by split size when using split_dir.) |
batch_size | int | 40 | Batch size | Tasks rolled out per optimization step. |
accumulation | int | 1 | Grad accumulation | Accumulation rounds per step. |
seed | int | 42 | Random seed | Reproducibility seed. |
6.4 gradient.* #
| Key | Type | Default | Description |
|---|---|---|---|
minibatch_size | int | 8 | Trajectories per reflect minibatch. |
merge_batch_size | int | 8 | Patches per merge batch during aggregation. |
analyst_workers | int | 16 | Parallel reflection workers (data parallelism). |
max_analyst_rounds | int | 3 | Max rounds of analyst reflection per step. |
failure_only | bool | false | Reflect only on failed trajectories when true. |
6.5 optimizer.* #
| Key | Type | Default | DL analogy | Description / options |
|---|---|---|---|---|
learning_rate | int | 4 | Learning rate | Max edit patches applied per step (the "edit budget"). |
min_learning_rate | int | 2 | Min LR | Floor edit budget for decaying schedulers. |
lr_scheduler | str | cosine | LR schedule | constant / linear / cosine / autonomous. |
lr_control_mode | str | fixed | — | fixed / autonomous / none. |
skill_update_mode | str | patch | — | patch / rewrite_from_suggestions / full_rewrite_minibatch. |
use_slow_update | bool | true | Momentum | Enable epoch-boundary slow update. |
slow_update_samples | int | 20 | — | Tasks sampled for the longitudinal comparison. |
slow_update_gate_with_selection | bool | false | — | false = force-inject guidance; true = gate it on the selection split (see §5.4). |
longitudinal_pair_policy | str | mixed | — | mixed / changed / unchanged — which comparison pairs to keep. |
use_meta_skill | bool | true | Meta-learning | Enable cross-epoch optimizer memory. |
use_skill_aware_reflection | bool | false | — | EmbodiSkill-style failure routing: SKILL_DEFECT (rule wrong/missing → gated body edit) vs EXECUTION_LAPSE (valid rule not followed → reminder appended to a protected appendix region that step-level edits never modify). Off = baseline-identical; resolved process-wide, works on every benchmark. Not supported with rewrite_from_suggestions / full-rewrite modes. |
skill_aware_appendix_source | str | both | — | both (success analyst may also re-emphasize rules) / failure_only (paper-faithful S_app: failure side only). |
skill_aware_consolidate_threshold | int | 0 | — | >0: LLM-compact the appendix once it exceeds N notes (experimental); 0 = off. |
6.6 evaluation.* #
| Key | Type | Default | Description / options |
|---|---|---|---|
use_gate | bool | true | Validation gating is mandatory in this branch (must remain true). |
gate_metric | str | hard | hard / soft / mixed — score used by the gate (see §5.3). |
gate_mixed_weight | float | 0.5 | Weight on the soft score when gate_metric = mixed. |
sel_env_num | int | 0 | Selection-split eval size (0 = use full split). |
test_env_num | int | 0 | Test-split eval size (0 = use full split). |
eval_test | bool | true | Run a final test evaluation after training. |
Setting evaluation.use_gate: false raises an error — validation gating cannot be disabled in this branch.
6.7 env.* #
| Key | Type | Default | Description |
|---|---|---|---|
name | str | "" | Benchmark name (searchqa, docvqa, alfworld, …). Selects the env module. |
skill_init | str | "" | Path to a seed skill (empty = start from scratch). |
split_mode | str | ratio | ratio or split_dir (see §4.3). |
split_dir | str | "" | Pre-split directory (when split_mode = split_dir). |
data_path | str | "" | Single dataset path (when split_mode = ratio). |
split_seed | int | 42 | Seed for deterministic ratio splitting. |
exec_timeout | int | 120 | Per-task target/code-agent timeout (seconds). |
out_root | str | "" | Output directory for the run. |
Env blocks may carry extra benchmark-specific keys (e.g. max_turns, workers, max_completion_tokens, limit). Unmapped env keys are passed straight through to the benchmark adapter — check the relevant configs/<benchmark>/default.yaml.
7.1 Supported Benchmarks #
| Benchmark | Type | Config |
|---|---|---|
| SearchQA | Question answering | configs/searchqa/default.yaml |
| DocVQA | Document QA | configs/docvqa/default.yaml |
| ALFWorld | Embodied agent | configs/alfworld/default.yaml |
| LiveMathematicianBench | Math reasoning | configs/livemathematicianbench/default.yaml |
| SpreadsheetBench | Spreadsheet code generation | configs/spreadsheetbench/default.yaml |
| OfficeQA | Tool-augmented QA | configs/officeqa/default.yaml |
Each benchmark is a self-contained module under skillopt/envs/<benchmark>/ with an adapter.py, dataloader.py, rollout.py, and evaluator.py (some add a custom reflect.py). Packaged reference skills live in ckpt/<benchmark>/.
7.2 Add a New Benchmark #
Use skillopt/envs/_template/ as a starting point. At minimum, implement:
- Dataloader — read your item JSON into the framework's item dicts (
dataloader.py). - Rollout — run the target on one item with the current skill and return a trajectory + score (
rollout.py). - Evaluator — score predictions against ground truth (
evaluator.py). - Adapter — wire the above into the trainer's expected interface and register the env name (
adapter.py).
Then add a configs/<name>/default.yaml inheriting _base_/default.yaml and set env.name to your new benchmark.
8.1 Module Map #
| Module | Responsibility |
|---|---|
skillopt/config.py | Load structured YAML, resolve _base_ inheritance, flatten to the trainer's flat dict, apply CLI overrides. |
skillopt/engine/trainer.py | ReflACTTrainer — orchestrates the whole loop, gating, slow update, meta skill, resume, and artifact writing. |
skillopt/gradient/ | Reflection ("backward pass"): reflect.py analysts, aggregate.py patch merging. |
skillopt/optimizer/ | The "optimizer": edit application, learning-rate scheduling, edit selection, slow update, meta skill, rewrite modes. |
skillopt/evaluation/gate.py | Pure accept/reject decision and metric selection. |
skillopt/model/ | Backend clients (OpenAI/Azure, Claude, Qwen, Codex/Claude-Code exec) and routing. |
skillopt/envs/<b>/ | Per-benchmark dataloader, rollout, evaluator, adapter. |
8.2 Core Functions #
| Function | File | Purpose |
|---|---|---|
load_config / flatten_config / apply_overrides | config.py | Load YAML with inheritance; flatten sections; apply key=value overrides. |
run_minibatch_reflect | gradient/reflect.py | Run error/success analysts over trajectory minibatches → edit patches. |
merge_patches | gradient/aggregate.py | Hierarchically merge semantically similar patches. |
rank_and_select | optimizer/clip.py | Rank edits and clip to the learning-rate budget. |
build_scheduler | optimizer/scheduler.py | Construct the LR (edit-budget) scheduler: constant/linear/cosine/autonomous. |
decide_autonomous_learning_rate | optimizer/lr_autonomous.py | Let the optimizer pick the next learning rate (autonomous mode). |
apply_patch / apply_edit | optimizer/skill.py | Apply edits to the skill document (respecting the protected region). |
rewrite_skill_from_suggestions | optimizer/rewrite.py | Full-rewrite update mode from accumulated suggestions. |
evaluate_gate / select_gate_score | evaluation/gate.py | Accept/reject decision; compute hard/soft/mixed score. |
run_slow_update | optimizer/slow_update.py | Produce epoch-boundary longitudinal guidance. |
replace_slow_update_field / extract_slow_update_field | optimizer/slow_update.py | Read/overwrite the protected guidance region. |
run_meta_skill / format_meta_skill_context | optimizer/meta_skill.py | Generate cross-epoch optimizer memory and render it into reflection context. |
8.3 CLI Scripts #
scripts/train.py
Runs a full training loop. Required: --config. Override config via --cfg-options section.key=value … or legacy flat flags (--num_epochs, --batch_size, --optimizer_model, --target_model, --lr_scheduler, --edit_budget, --split_dir, …).
scripts/eval_only.py
Evaluates a skill document without training. Required: --config and --skill. Use --split to choose train / valid_seen / valid_unseen / all.
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill outputs/my_run/best_skill.md \
--split valid_unseen
8.4 WebUI #
An optional Gradio dashboard to configure parameters and monitor runs:
pip install -e ".[webui]"
python -m skillopt_webui.app # http://localhost:7860
python -m skillopt_webui.app --share # public share link
| Flag | Default | Description |
|---|---|---|
--port | 7860 | Server port. |
--host | 0.0.0.0 | Bind address. |
--share | off | Create a public Gradio share link. |