Terminal-native web agents

A terminal is all you need for web agents.

Webwright gives the model a terminal, a local workspace, and the freedom to write code that launches, inspects, and discards browser sessions. The output is not just a completed task, but a reusable program.

3 core modules
~1K lines of harness code
86.7% Online-Mind2Web accuracy
60.1% Odysseys score

Paradigm shift

In Webwright, agent can launch multiple browser sessions in terminal.

Traditional web agents keep one browser session alive and predict the next click, type, or scroll. Webwright separates the agent from that session: the browser can be launched, inspected, and discarded, while code, logs, screenshots, and outputs persist in the local workspace.

Disposable browsers

The agent can spawn fresh browser sessions, capture screenshots only when useful, inspect failures, and rerun scripts without being trapped in a single stateful page.

Code composes actions

Date selection, form filling, filtering, comparison, and extraction can become loops and functions instead of long chains of primitive browser actions.

Artifacts survive

The durable output is a workspace: exploratory scripts, action logs, screenshots, final outputs, and eventually a reusable task program.

Minimal harness

One loop, three modules, no orchestration tower.

The implementation is deliberately small: a Runner, a Model Endpoint, and a terminal Environment. Each is a single module, totaling roughly 1K lines of harness code, with no multi-agent orchestration or complex planning hierarchy.

  1. 01 Send contextThe runner sends the task, workspace state, and recent observations to the model.
  2. 02 Emit bashThe model returns a thinking block and a shell command, often writing Playwright-backed scripts to explore pages and collect data.
  3. 03 Return observationsThe environment runs the command and returns terminal output, logs, screenshots, files, or error tracebacks.
  4. 04 Refine and finishThe loop continues until the agent produces a final script, reruns it in a fresh folder, and passes self-reflection.
workspace/run
$ python final_script.py
open browser
search live web pages
capture screenshots
write action log

$ python -m webwright.tools.self_reflection
evaluate critical points
Status: success

$ ls final_runs/run_1
final_script.py
final_script_log.txt
screenshots/
self_reflect_result.json

Workspace trace

Watch a long web task turn into files, commands, and a verified final run.

The trace below makes the terminal-native loop visible. The left panel shows the workspace growing as the agent creates plans, scripts, logs, screenshots, and final-run artifacts; the terminal transcript shows the generated command and command_output that produced each observation.

Generated CLI tools / skills Four domain workflows demonstrate reusable task programs that can be packaged, selected, and compared against baseline runs.
Flights with skill
Generated CLI Tool

Skill-guided Google Flights comparison for a Hong Kong to Jeju trip.

Shows a generated flight skill being selected and reused as a task-specific tool.

Challenges handled

Open-ended terminal actions need verification, memory control, and reusable outputs.

Giving an agent a terminal is powerful, but it creates new failure modes. Webwright keeps the harness small while adding just enough structure around completion, context, and reuse.

Premature done gate

The agent must generate a final script, rerun it in a fresh folder, save logs and screenshots, and pass a self-reflection judgement before done is accepted.

Context compaction

Long coding trajectories can exceed context limits, so history is periodically compacted into summaries while the workspace keeps the concrete artifacts.

Reusable tools

Once solved, a task script can be parameterized, exported as a CLI, shared with coding agents, and reused instead of rediscovered from scratch.

Reported results

A small harness, competitive long-horizon performance.

The report evaluates Webwright on live, long-horizon web benchmarks while preserving the simple terminal interface. The same pipeline also records critical-point screenshots, action logs, and reusable command-line tools.

Odysseys long-horizon score plot
Odysseys — long-horizon browsing
Online-Mind2Web accuracy plot
Online-Mind2Web — accuracy on 300 live tasks
60.1%

Odysseys

Long-horizon browsing score, a 35.1% relative improvement over the previous reported SOTA.

86.7%

Online-Mind2Web

GPT-5.4 accuracy on 300 live tasks across 136 sites with a 100-step budget.

$2.37

Average cost

Average GPT-5.4 cost per Online-Mind2Web task in the report's cost analysis.

66.2%

Small model tools

Qwen3.5-9B on the hard split of Online-Mind2Web when augmented with crafted reusable tools.

Citation

Cite this work.

If you use Webwright in your research or build on it, please cite the repository:

@misc{webwright2026,
  title        = {Webwright: A terminal is all you need for web agents},
  author       = {Lu, Yadong and Xu, Lingrui and Huang, Chao and Awadallah, Ahmed},
  year         = {2026},
  howpublished = {\url{https://github.com/microsoft/Webwright}},
  note         = {GitHub repository}
}