BizGenEval A Systematic Benchmark for Commercial Visual Content Generation

Yan Li*· Zezi Zeng*· Ziwei Zhou*· Xin Gao*· Muzhao Tian*· Yifan Yang· Mingxi Cheng· Qi Dai
Yuqing Yang· Lili Qiu· Zhendong Wang· Zhengyuan Yang· Xue Yang· Lijuan Wang· Ji Li· Chong Luo
* Equal contribution. This work was done during their internship at Microsoft. · Corresponding authors: Yifan Yang, Xue Yang.
5
Content Domains
4
Capability Dimensions
400
Curated Samples
8K
Verified Questions
26
Evaluated Models
Results

Leaderboard

Performance rankings across the overall average, five content domains, and four capability dimensions. Each chart shows Hard (upper) and Easy (lower) scores per model.

By Content Domain

By Capability Dimension

Hard subset (deeper)
Easy subset (lighter tint)
Methodology

Construction & Evaluation Pipeline

We introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types—slides, charts, webpages, posters, and scientific figures—and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8,000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints.

Overview of the construction and evaluation pipeline
Figure: Overview of the construction and evaluation pipeline. The system converts real-world references and domain knowledge into structured prompts, and evaluates generated images using rigorous checklists.
Qualitative Analysis

Qualitative Result

Side-by-side qualitative comparisons reveal where current models succeed and fail. Correct regions are highlighted in blue, incorrect regions in red.

Qualitative results per domain
Figure: Qualitative evaluation across five content domains. Columns represent domains; rows show model outputs. Correct and incorrect regions are highlighted in blue and red boxes respectively.
Qualitative results per capability
Figure: Qualitative evaluation across four capability dimensions: Layout Control, Attribute Binding, Text Rendering, and Knowledge-based Reasoning.
Benchmark Design

Summary

BizGenEval example overview

Examples

Webpage example
Scientific figure example
Slides example
Chart example
Poster example
Knowledge example
Attribute example
Layout example
Text example
Citation

Cite this Work

If BizGenEval is useful for your research, please consider citing our paper.

@misc{li2026bizgeneval, title = {BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation}, author = {Yan Li and Zezi Zeng and Ziwei Zhou and Xin Gao and Muzhao Tian and Yifan Yang and Mingxi Cheng and Qi Dai and Yuqing Yang and Lili Qiu and Zhendong Wang and Zhengyuan Yang and Xue Yang and Lijuan Wang and Ji Li and Chong Luo}, year = {2026}, eprint = {2603.25732}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.25732}, }