Microsoft Research · documentation hub

SkillOpt Documentation & Reproduction Guide

Improve frozen agents by optimizing the Markdown skills that guide them—using reflective updates and held-out validation instead of weight training.

How this guide stays accurate This page is a stable, concise entry point. Detailed commands, defaults, and APIs live in the versioned Markdown documentation beside the code. For exact behavior in a checkout, the command's --help, the selected YAML config, and the code are authoritative.

Overview

SkillOpt treats a natural-language skill document as the trainable state of an agent. A target model executes tasks, an optimizer reflects on the resulting trajectories, bounded edits form a candidate skill, and a validation gate decides whether to keep it.

Research engine

Run reproducible training and evaluation over benchmark splits. Six released benchmark configurations cover QA, document QA, embodied agents, math, spreadsheets, and tool-augmented QA.

SkillOpt-Sleep preview

Harvest supported coding-agent sessions, mine replayable tasks, and stage proposed memory or skill updates for review. It is a separate, evolving deployment companion—not the paper's benchmark runner.

The optimizer and target are separate roles and may use different backends. Validation gating is the research default and the paper-style setting; deliberately disabling it force-accepts candidates and changes the experiment semantics. SkillOpt-Sleep stages updates by default; automatic adoption is opt-in.

Choose the right workflow

Goal	Start with
Reproduce paper-style benchmark training	Research first experiment
Evaluate an existing skill without training	Evaluation CLI reference
Add a benchmark adapter	New benchmark guide
Connect another model provider	Backend guide
Improve a coding-agent skill from local sessions	SkillOpt-Sleep

Install

SkillOpt requires Python 3.10 or newer.

# Published package
python -m pip install skillopt

# Latest source and development workflow
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
python -m pip install -e .

# Install only the extras you need
python -m pip install -e ".[searchqa]"  # SearchQA materialization
python -m pip install -e ".[alfworld]"  # ALFWorld
python -m pip install -e ".[claude]"    # optional Claude agent SDK support
python -m pip install -e ".[webui]"     # Gradio dashboard
python -m pip install -e ".[dev]"       # tests and linting

Release boundary This guide tracks main. PyPI currently serves 0.2.0; the generic research openai_compatible backend, Sleep handoff, SkillOpt-Sleep support for non-Azure OpenAI-compatible endpoints, and the Sleep --preferences flag require a source install from main until the next release.

See the installation guide for platform notes and dependency boundaries.

Credentials and endpoint families

Copy .env.example, fill only the backend you use, and load it into your shell. Do not commit the resulting .env.

cp .env.example .env
set -a
source .env
set +a

Azure OpenAI

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-12-01-preview"
export AZURE_OPENAI_AUTH_MODE="api_key"
export AZURE_OPENAI_API_KEY="your-key"

For keyless Azure authentication, use azure_cli or managed_identity and follow the configuration guide. Setting an API key without AZURE_OPENAI_AUTH_MODE=api_key does not change the default authentication mode.

Generic OpenAI-compatible research backend

export OPENAI_COMPATIBLE_BASE_URL="https://api.example.com/v1"
export OPENAI_COMPATIBLE_API_KEY="your-key"
export OPENAI_COMPATIBLE_MODEL="provider-model"

python scripts/train.py --config configs/searchqa/default.yaml \
  --cfg-options \
  model.optimizer_backend=openai_compatible \
  model.target_backend=openai_compatible \
  model.optimizer=provider-model \
  model.target=provider-model

This provider-neutral backend is distinct from Azure OpenAI. Per-role overrides use OPTIMIZER_OPENAI_COMPATIBLE_* and TARGET_OPENAI_COMPATIBLE_*. Train/eval applies the YAML role models after backend initialization, so they override model-name environment variables.

OpenAI-compatible endpoints in SkillOpt-Sleep

The Sleep CLI exposes this compatibility path through its azure_openai backend for backward compatibility, so it uses a different environment-variable family:

export AZURE_OPENAI_ENDPOINT="https://api.example.com/v1"
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_AUTH_MODE="openai_compatible"

skillopt-sleep run --backend azure_openai --model provider-model

Do not mix this mode with Azure CLI or managed-identity settings. See the dedicated Sleep endpoint guide.

Research engine: first experiment

The repository ships deterministic ID manifests, not the benchmark examples themselves. Materialize the SearchQA examples once, then run its checked-in config:

python -m pip install -e ".[searchqa]"
python scripts/materialize_searchqa.py

# Load model credentials first, then:
python scripts/train.py --config configs/searchqa/default.yaml

The run directory contains best_skill.md, runtime_state.json, history.json, versioned files under skills/, and step-level artifacts. Re-running with the same output root resumes from persisted state.

python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/<run>/best_skill.md \
  --split valid_unseen

Reproduction boundary Use the released train/validation/test manifests and record the exact model deployment, config, seed, and source revision. Provider behavior can change independently of this repository.

Continue with the first-experiment guide and dataset manifest documentation.

Research model backends

Backend	Optimizer	Target	Notes
`openai_chat`	Yes	Yes	Azure OpenAI plus its explicit authentication modes.
`openai_compatible`	Yes	Yes	Provider-neutral chat-completions endpoint.
`claude_chat`	Yes	Yes	Runs an installed, authenticated Claude Code CLI via `claude -p`; not a direct Anthropic API client.
`qwen_chat`	Yes	Yes	Local or hosted Qwen-compatible server.
`minimax_chat`	Yes	Yes	MiniMax chat endpoint.
`codex_exec`	Yes	Supported adapters only	Executes Codex for optimizer calls and as a target agent where supported.
`claude_code_exec`	No	Supported adapters only	Executes Claude Code as a target agent.

Prefer the structured model.optimizer_backend and model.target_backend settings. Legacy --backend aliases do not expose every role-specific combination. Exec backends are not generic chat replacements and require adapter support.

Research documentation map

Reference	Use it for
Configuration	Authentication, structured YAML, role-specific backends, and overrides.
CLI	Current train/eval entry points and exact output paths.
Config reference	Supported sections, defaults, and validation constraints.
Training loop	Rollout, reflection, edit selection, gating, slow update, and meta skill.
Skill document	Skill structure and protected regions.
Python API	Stable public imports and low-level/internal boundaries.
Changelog	Recently merged capabilities, fixes, and contributor credits.

SkillOpt-Sleep: safe first run

SkillOpt-Sleep is a preview deployment companion. Its default mock backend is useful for testing control flow without API spend; it is not evidence that a real model's quality improved.

# Deterministic engine proof; no model credentials required
python -m skillopt_sleep.experiments.run_experiment \
  --persona researcher --assert-improves

# Inspect local session handling without adopting any update
skillopt-sleep dry-run \
  --project "$PWD" \
  --source auto \
  --backend mock

# A real run: explicitly identify the skill to evolve
skillopt-sleep run \
  --project "$PWD" \
  --target-skill-path path/to/SKILL.md \
  --source auto \
  --backend claude

skillopt-sleep status --project "$PWD"
skillopt-sleep adopt --project "$PWD"

--project scopes collection but does not automatically choose a project's skill file. Use --target-skill-path when you intend to evolve a particular SKILL.md. Transcript source (claude, codex, or auto) and replay backend are independent settings.

For subscription-based workflows that should not launch an API or model subprocess, use --backend handoff and follow the generated prompt/answer loop. Read the complete Sleep guide before a real run.

Agent integrations

Agent	Integration status	Guide
Claude Code	Shared-engine plugin and handoff command	README
Codex	Shared-engine skill shell	README
GitHub Copilot	Shared-engine Sleep MCP plus a separate research MCP	README
Devin	Shared-engine MCP with Devin transcript conversion	README
OpenClaw	Independent community/reference adaptation; review locally before use	README

The plugin overview records which integrations use the shared engine and which require local adaptation.

Advanced Sleep controls

The main CLI exposes project/source selection, backend/model selection, bounded task and edit counts, preferences, reviewed task files, and staged adoption. Additional JSON configuration fields include:

Field	Default	Status
`dream_rollouts`	1	Single rollout by default; values above 1 enable experimental contrastive replay.
`dream_factor`	0	Synthetic task variants are off by default.
`recall_k`	0	Historical associative recall is off by default.

These are configuration fields, not current skillopt-sleep run flags. Treat multi-rollout, recall, synthetic dreaming, and experimental reward/budget controls as advanced features that require task-specific validation. The reported experiments and their exact settings are in RESULTS.md.

Data, privacy, and adoption safety

Real Sleep backends may send session-derived prompts, mined tasks, trajectories, and candidate edits to the selected provider. Review the source data and provider policy before use.
Secret redaction for persisted diagnostics is defense in depth; it is not a guarantee that every outbound model prompt is free of sensitive content. In particular, do not treat raw coding-agent transcripts as pre-sanitized.
Updates are staged for review by default. Use --auto-adopt only when you have an independent rollback and validation process.
A held-out gate reduces regressions on its measured tasks; it is not a security boundary or a proof of general improvement.
Use a temporary clone and synthetic transcripts when validating a new backend or plugin integration.

Contributing and extending

Before proposing a change, run the focused tests for the affected area, then the full suite where practical. Documentation changes should pass a strict MkDocs build and should be checked against actual CLI --help output.

python -m pip install -e ".[dev,docs]"
python -m pytest -q
python -m mkdocs build --strict

See CONTRIBUTING.md, the documentation workflow, and the focused guides for benchmarks and model backends.