Skip to content

Evaluating Custom Agents

VS Code custom agents are specialized AI personas defined in .agent.md files. They extend VS Code’s Copilot with custom instructions, tool constraints, and behavioral guidelines. Unlike general skills, agents can specify which tools they’re allowed to use, define model preferences, and coordinate with other agents through handoffs.

For details on creating custom agents in VS Code, see the VS Code custom agents documentation.

Waza treats .agent.md files as first-class evaluation targets, just like SKILL.md. The key differences:

AspectSKILL.md.agent.md
PurposeDefine reusable Copilot skillsDefine custom VS Code agent personas
Frontmatter fieldsname, description, triggersname, description, tools, model, handoffs, mcp-servers, agents
Tool usageNo built-in constraintDeclares allowed tools in tools: field
Auto constraintsManual tool_constraint grader neededImplicit tool_constraint auto-injected if tools: field present
When both existSKILL.md takes priority in same directory.agent.md evaluated only if SKILL.md absent

Waza includes a complete security-reviewer agent example:

Terminal window
waza run examples/custom-agent/eval.yaml -v

This evaluates a custom security-reviewer agent with tasks covering code review, vulnerability detection, and compliance checking.

Create an eval.yaml:

name: my-agent-eval
description: Evaluating my custom agent
skill: my-agent # Points to my-agent.agent.md in the same directory
version: "1.0"
config:
model: claude-sonnet-4.6
timeout_seconds: 300
graders:
- type: text
name: identifies_issues
config:
regex_match:
- "(?i)(bug|issue|security|error)"
tasks:
- "tasks/*.yaml"
Terminal window
waza run eval.yaml -v

Waza discovers your my-agent.agent.md, parses its frontmatter, and auto-injects a tool_constraint grader if your agent declares a tools: field.

Here’s a complete annotated custom agent:

---
name: security-reviewer
description: |
A specialized agent that reviews code for security vulnerabilities,
compliance issues, and best practices. Flags suspicious patterns
and suggests hardening strategies.
model: claude-opus-4.6
tools:
- codeSearch
- fileRead
- fileWrite
- runCommand
mcp-servers:
- url: "sse://internal-tools.example.com"
name: "compliance-checker"
handoffs:
- name: legal-review
when: "sensitive compliance issues found"
- name: performance-analyst
when: "security patch impacts performance"
agents:
- name: code-analyzer
scope: "async dependency analysis"
---
# Security Review Agent
You are an expert security reviewer specializing in code analysis, vulnerability detection, and compliance validation.
## Your role
Analyze code submissions for:
- Common security vulnerabilities (injection, XSS, CSRF, etc.)
- Cryptographic issues
- Data handling compliance (PII, sensitive data)
- Third-party dependency risks
- Infrastructure-as-code misconfigurations
## Tool usage guidelines
- Use `codeSearch` to identify similar patterns across the codebase
- Use `fileRead` to examine context and dependencies
- Use `runCommand` to check for known vulnerability patterns (SAST integration)
- Use `fileWrite` only to create reports, never modify source directly
## Response format
Provide:
1. **Risk Level:** Critical / High / Medium / Low
2. **Issue Description:** What was found and why it matters
3. **Affected Code:** File path and line numbers
4. **Remediation:** Concrete fix or best practice
5. **References:** CWE/CVE links if applicable
FieldTypeRequiredWaza behavior
namestringYesAgent identifier; must match filename (security-reviewer for security-reviewer.agent.md)
descriptionstringYesSummary of agent purpose; used in discovery and trigger tests
modelstringNoPreferred model; waza uses config.model from eval.yaml (field is informational)
toolsarrayNoAllowed tool names; waza auto-injects tool_constraint grader to validate only these tools are used
mcp-serversarrayNoModel Context Protocol servers; parsed but not yet evaluated (P2 roadmap)
handoffsarrayNoHandoff definitions; parsed but not yet evaluated (P2 roadmap)
agentsarrayNoCoordinating agents; parsed but not yet evaluated (P2 roadmap)

The markdown body becomes the agent’s system message. Waza injects it into the execution context to guide the agent’s behavior during evaluation tasks.

Waza searches for .agent.md files in the same locations as SKILL.md:

  • Current directory
  • ./agents/ subdirectory
  • ./skills/ subdirectory
  • Paths specified in eval.yaml
Terminal window
# Discovers any .agent.md or SKILL.md in the current directory
waza run eval.yaml
# Explicitly target an agent in a subdirectory
waza run my-evals/eval.yaml --context-dir ./agents

When your .agent.md declares a tools: field, waza automatically adds a tool_constraint grader to validate that only those tools were called during task execution.

Before (without agent’s tools field):

graders:
- type: text
name: code_quality
config:
regex_match: ["(?i)(refactored|improved)"]

After (if agent has tools: [codeSearch, fileRead, runCommand]):

graders:
- type: tool_constraint # 🔄 Auto-injected
name: allowed_tools
config:
allow: [codeSearch, fileRead, runCommand]
- type: text
name: code_quality
config:
regex_match: ["(?i)(refactored|improved)"]

To use a different tool constraint or opt out entirely, declare your own in eval.yaml:

graders:
- type: tool_constraint
name: my_tool_policy
config:
allow: [codeSearch, fileRead] # Stricter than agent declares
reject: [fileDelete, runCommand]
- type: text
name: code_quality
config:
regex_match: ["(?i)(refactored|improved)"]

When you declare a tool_constraint grader, waza skips auto-injection for that agent.

Target an agent by name or file path:

name: security-agent-eval
description: Evaluate the security-reviewer agent
skill: security-reviewer # Resolves to security-reviewer.agent.md
version: "1.0"
config:
model: claude-sonnet-4.6
timeout_seconds: 300
trials_per_task: 1
# Graders: define what success looks like
graders:
- type: text
name: identifies_vulnerability
config:
regex_match:
- "(?i)(vulnerability|bug|issue)"
- "(?i)(risk level|severity)"
- type: code
name: suggests_fix
config:
assertions:
- "output_length > 200"
tasks:
- "tasks/*.yaml"

Key points:

  • Use the skill: field with the agent name to target an agent
  • All other eval.yaml structure is identical
  • Discovery and grading work the same way
  • The agent’s tools: field auto-injects a constraint (unless you override it)

If both files are present in the same directory, SKILL.md takes priority. Waza evaluates the skill and ignores the agent.

Workaround: If you want to evaluate the agent instead:

  • Rename or move the SKILL.md file temporarily, or
  • Place the .agent.md in a different directory and reference it explicitly in eval.yaml

This behavior ensures backward compatibility—existing SKILL.md-based projects work unchanged.

The waza coverage command now reports both SKILL.md and .agent.md files:

Terminal window
waza coverage
# Output:
# Skills found: 3
# ✓ code-explainer (SKILL.md)
# ✓ documentation-writer (SKILL.md)
# ✓ security-reviewer (.agent.md)
#
# Evaluations found: 3
# ✓ evals/code-explainer.yaml
# ✓ evals/documentation-writer.yaml
# ✓ evals/security-reviewer.yaml

Each agent is counted alongside skills. Use this to audit your evaluation coverage.

The examples/custom-agent/ directory contains a complete, runnable security-reviewer agent with evaluation suite.

examples/custom-agent/
├── security-reviewer.agent.md # Agent definition with tools
├── eval.yaml # Evaluation spec
├── tasks/
│ ├── 01-injection-detection.yaml
│ ├── 02-crypto-review.yaml
│ ├── 03-data-handling.yaml
│ └── 04-infrastructure-scan.yaml
└── fixtures/
├── vulnerable-sql.py
├── weak-crypto.js
└── pii-exposure.tf
Terminal window
cd examples/custom-agent
waza run eval.yaml -v
# Output:
# Running: security-reviewer eval
# Task: injection-detection ... ✓ PASS
# Task: crypto-review ... ✓ PASS
# Task: data-handling ... ⚠ PARTIAL (4/5 validators)
# Task: infrastructure-scan ... ✓ PASS
#
# Results saved to results.json
TaskPurposeValidates
injection-detectionSQL/command injection vulnerabilitiesTool constraint (codeSearch, fileRead only)
crypto-reviewCryptographic weaknessesIdentifies weak algorithms and recommends alternatives
data-handlingPII and sensitive data exposureDetects hardcoded secrets, unencrypted fields
infrastructure-scanTerraform/IaC misconfigurationsChecks for open security groups, missing logging

The auto-injected tool_constraint grader validates that only [codeSearch, fileRead, runCommand] are called—the tools declared in the agent’s frontmatter.

name: injection-detection
description: Agent identifies SQL injection vulnerabilities
triggers:
- "Review this Python code for security issues"
fixtures:
- vulnerable-sql.py
expected_outcome: |
- Identifies the SQL injection vulnerability on line 8
- Explains the risk
- Suggests parameterized queries as mitigation
  • handoffs field — Parsed but not yet evaluated. P2 feature to test agent coordination.
  • mcp-servers field — Parsed but not yet evaluated. P2 feature to validate MCP server integration.
  • Multiple .agent.md files — If a directory contains multiple .agent.md files, only the first is discovered. Workaround: use subdirectories or rename to avoid collisions.
  • No handoff testing — Coming in a future release to validate agent coordination.

See GitHub issue #225 for status.