A Hierarchical Multimodal Web Agent for Webpage Generation

Yan Li1,* Zezi Zeng2,* Yifan Yang4,† Yuqing Yang4 Ning Liao1 Weiwei Guo3 Lili Qiu4 Mingxi Cheng4 Qi Dai4 Zhendong Wang4 Zhengyuan Yang4 Xue Yang1,† Ji Li4 Lijuan Wang4 Chong Luo4
1Shanghai Jiao Tong University 2Xi'an Jiaotong University 3Tongji University 4Microsoft Corporation
* Equal contribution. This work was done during their internship at Microsoft. Corresponding to: Yifan Yang <yifanyang@microsoft.com>, Xue Yang <yangxue-2019-sjtu@sjtu.edu.cn>.

Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce MM-WebGEN-Bench and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration.

Method

An overview of the proposed framework MM-WebAgent
An overview of the proposed framework MM-WebAgent. The framework generates webpages through four key steps: Task planning, hierarchical generation, multi-level evaluation and iterative reflection.

Hierarchical Planning and Generation

Global Layout Planning.
  • Define section hierarchy, ordering, and spatial organization.
  • Specify section content, page-level style attributes, and element layout.
  • Introduce placeholders for multimodal elements with positions, sizes, and constraints.
Local Element Planning.
  • Construct a local plan for each multimodal element in the global layout.
  • Include context information, meta attributes, and the generation tool.
  • Execute the global layout and local generators, then insert the generated assets into the webpage.

Hierarchical Self Reflection

Local

Local refine.

Improve the intrinsic quality of each multimodal element.

Context

Context refine.

Refine surrounding HTML snippets to resolve misalignment, clipping, and spacing issues.

Global

Global refine.

Use HTML code and rendered screenshots to enforce layout and style consistency across sections.

MM-WebGEN-Bench

Overview of MM-WebGEN-Bench
Overview of MM-WebGEN-Bench. (a) Dataset construction process, including data generation controlled by layout complexity, visual style, semantic intent, and multimodal elements, followed by a filtering pipeline with automatic format validation and manual quality control. (b) Statistical summary of the final evaluation set, consisting of 120 webpages spanning 11 scene categories and 11 visual styles, and featuring diverse multimodal compositions, including 4 types of videos, 8 types of images, and 17 types of charts.
120
Webpages
11
Scene Categories
11
Visual Styles
4
Video Types
8
Image Types
17
Chart Types
Evaluation Dataset.
  • Generate prompts by sampling layout complexity, visual style, multimodal elements, and semantic intent.
  • Apply automatic format validation and manual inspection to filter implausible layouts and inconsistent styles.
Multi-level Evaluation.
  • Global level evaluation: layout correctness, style coherence, and aesthetic quality.
  • Local level evaluation: images, videos, and charts.
  • Missing or incomplete elements implied by the user prompt are treated as critical failures at the local level.

Experiments

Main Results

Paradigm Comparison on MM-WebGEN-Bench. MM-WebAgent achieves the best overall score (0.75) and improves both global metrics (layout, style, aesthetics) and local metrics (image, video, chart).

Method Global Local Average
Layout Style Aesthetics Image Video Chart
(I) Code-only One-shot
Qwen2.5-Coder-7B-Instruct0.010.000.780.410.000.240.24
Qwen2.5-Coder-32B-Instruct0.090.030.840.390.020.280.27
Qwen3-Coder-30B-A3B-Instruct0.130.150.570.080.000.250.20
Qwen2.5-72B-Instruct0.100.020.820.400.000.250.27
Gemini-2.5-Pro0.570.240.940.430.000.450.44
OpenAI-GPT-4o0.020.050.480.060.000.020.11
OpenAI-GPT-5mini0.630.400.950.210.000.500.45
OpenAI-GPT-50.780.400.960.140.020.520.47
OpenAI-GPT-5.10.730.440.960.050.000.350.42
(II) Code-only Agents
i) Bolt.diy
Qwen2.5-Coder-7B-Instruct0.020.030.770.360.000.230.23
Qwen2.5-Coder-32B-Instruct0.080.020.850.480.020.310.29
Qwen3-Coder-30B-A3B-Instruct0.120.070.710.150.000.320.23
Qwen2.5-72B-Instruct0.070.030.830.310.050.300.26
Gemini-2.5-Pro0.630.240.930.380.000.500.45
OpenAI-GPT-4o0.040.020.850.210.000.120.21
OpenAI-GPT-5mini0.670.360.950.120.000.480.43
OpenAI-GPT-50.770.430.950.060.000.500.45
OpenAI-GPT-5.10.740.390.960.300.000.360.46
ii) OpenHands
Gemini-2.5-Pro0.430.210.930.310.000.470.39
OpenAI-GPT-4o0.030.020.830.110.000.040.17
OpenAI-GPT-5mini0.600.310.940.050.000.470.39
OpenAI-GPT-50.760.410.950.020.000.490.44
OpenAI-GPT-5.10.610.330.910.000.000.360.37
(III) Multimodal Web Agents
Gemini-2.5-Pro0.680.350.960.810.570.430.63
OpenAI-GPT-4o0.160.100.860.420.290.320.36
OpenAI-GPT-5mini0.730.420.950.840.630.500.68
OpenAI-GPT-50.850.530.970.860.520.540.71
OpenAI-GPT-5.10.830.540.970.880.750.540.75

Qualitative Results

MM-WebAgent generates webpages with more coherent layouts, more consistent visual styles, and better-aligned multimodal content than representative baselines.

More rendered webpage examples generated by MM-WebAgent and baseline methods on MM-WebGEN-Bench
More rendered webpage examples generated by MM-WebAgent and baseline methods on MM-WebGEN-Bench.
Visualization of the hierarchical reflection process
Visualization of the hierarchical reflection process. Examples include global layout refinement, context refinement (first row), local element refinement (second row), and local-to-global correction (third row).

Cite

If you find this work useful, please cite:

@misc{li2026mmwebagent,
  title={MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation},
  author={Yan Li and Zezi Zeng and Yifan Yang and Yuqing Yang and
          Ning Liao and Weiwei Guo and Lili Qiu and Mingxi Cheng and
          Qi Dai and Zhendong Wang and Zhengyuan Yang and Xue Yang and
          Ji Li and Lijuan Wang and Chong Luo},
  year={2026},
  eprint={2604.15309},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2604.15309},
  url={https://arxiv.org/abs/2604.15309}
}