AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Framework Overview

AVGen-Bench evaluates T2AV systems across three granularities: basic uni-modal quality, cross-modal alignment, and fine-grained semantic controllability.

Benchmark Comparison

Compared with prior benchmarks, AVGen-Bench emphasizes joint AV evaluation, richer fine-grained metrics, and realistic complex prompts.

Comparison of AVGen-Bench with prior benchmarks

Main Quantitative Results

AVGen-Bench follows the paper's reporting narrative across visual/audio quality, synchronization, text/face/music/speech controllability, physical plausibility, and holistic semantic alignment. The table reports `AV`/`Lip` as complementary synchronization measurements, while `Lo-Phy` and `Hi-Phy` are weighted as separate fine-grained dimensions.

Model	Components	Vis	Aud (PQ)	AV	Lip	Text	Face	Music	Speech	Lo-Phy	Hi-Phy	Holistic	Total
Seedance 2.0	Seedance 2.0	0.945	7.15	0.15	4.14	74.83	60.95	28.12	94.09	3.89	83.16	89.61	72.07
Veo 3.1-fast	Veo 3.1-fast	0.960	6.64	0.21	2.39	75.10	52.77	3.13	94.53	3.68	67.43	86.27	67.87
Veo 3.1-quality	Veo 3.1-quality	0.954	6.77	0.24	3.59	76.53	52.90	5.00	96.09	3.74	68.53	84.10	66.28
Sora-2	Sora-2	0.848	5.91	0.25	4.50	74.84	51.17	7.81	88.63	4.05	78.95	88.89	64.16
Wan2.6	Wan2.6	0.959	7.15	0.30	4.32	76.95	49.27	1.75	89.33	3.69	66.92	80.98	62.97
Seedance-1.5 Pro	Seedance-1.5 Pro	0.970	7.48	0.26	3.43	38.28	54.42	1.88	93.45	3.72	66.88	77.38	62.55
Kling-V2.6	Kling-V2.6	0.906	6.93	0.21	2.30	14.52	57.33	5.00	89.62	3.84	63.92	76.74	61.82
LTX-2.3	LTX-2.3	0.858	7.11	0.36	2.00	54.17	45.06	1.38	86.66	3.99	64.31	65.22	59.97
NanoBanana2 + MOVA	NanoBanana2 MOVA	0.890	6.71	0.44	2.70	68.26	41.33	0.59	82.45	3.91	60.95	72.48	58.10
LTX-2	LTX-2	0.828	6.84	0.23	4.76	24.76	48.53	5.75	87.07	4.05	60.20	66.59	56.62
Emu3.5 + MOVA	Emu3.5 MOVA	0.911	6.80	0.38	4.83	64.72	48.44	0.62	81.74	3.89	55.85	66.55	56.12
Wan2.2 + HunyuanVideo-Foley	Wan2.2 HunyuanVideo-Foley	0.936	6.60	0.23	5.38	48.46	36.23	3.44	53.40	3.90	54.11	60.63	53.29
Ovi	Ovi	0.839	6.31	0.37	5.40	41.36	49.05	11.25	76.49	3.93	52.92	57.45	52.02

Metric direction: higher is better for Vis, Aud (PQ), Text, Face, Music, Speech, Lo-Phy, Hi-Phy, and Holistic; lower is better for AV and Lip.

Models are sorted by Total in descending order. Bold marks the best score, and italics mark the second-best score in each metric. Orange tags indicate proprietary components, while blue tags indicate open-source components.

Fine-grained Evaluation Cases

Fine-grained evaluation modules — Detailed workflow of six fine-grained evaluation modules.

Failure examples across fine-grained dimensions — Representative failure modes revealed by AVGen-Bench.

Failure Demo Videos

Multi-model qualitative failures from Appendix A. Each case shows the original prompt and side-by-side outputs from Veo 3.1 Fast, Ovi, LTX-2, and Kling 2.6.

Case 1: Prompted Text Rendering ("Your customers are talking")

Original Prompt

A single wind-up chattering teeth toy clacks continuously against a solid teal background. The scene cuts to a blue screen displaying the white text "Your customers are talking," abruptly followed by rows of multi-colored chattering teeth toys all moving at once, creating a loud chaotic mechanical clatter. A green screen appears with the text "Are you listening?" before cutting to a generic product logo and a "Try it free" button on a white background as the noise ceases.

Veo 3.1 Fast

Ovi

LTX-2

Kling 2.6

Case 2: Trailer Title Rendering ("EIGHTY-SEVEN SECONDS")

Original Prompt

Four-shot high-tempo teaser with clean sync hits. Shot 1: Inside a bank vault, fluorescent hum and distant alarms; a timer on a device beeps faster as a thief whispers, "Eighty-seven seconds, move." Shot 2: Close-up of a glass cutter scoring a pane with a sharp scratch, then a suction cup pops as the circle lifts free, landing on a bass hit. Shot 3: Smash cut to a getaway car; engine revs, tires chirp, and the car fishtails out of a tight alley with gravel spraying and rattling off the chassis. Shot 4: A final slow-motion shot of a duffel bag hitting the pavement with a heavy thud as sirens surge; the title EIGHTY-SEVEN SECONDS slams onto black with a metallic logo sting.

Veo 3.1 Fast

Ovi

LTX-2

Kling 2.6

Case 3: Physical Plausibility (Chladni Plate)

Original Prompt

A top-down view of a black square metal plate sprinkled evenly with fine white sand as a tone generator plays a pure sine wave that sweeps upward in pitch. As the plate begins to vibrate, the rising tone makes the sand suddenly jitter and chatter across the metal, then fall quiet as grains slide into crisp geometric nodal lines that sharpen and rearrange each time the pitch crosses a new resonance.

Veo 3.1 Fast

Ovi

LTX-2

Kling 2.6

Case 4: Physical Plausibility (Briggs-Rauscher)

Original Prompt

A high-speed time-lapse shows a beaker on a magnetic stirrer, the stir plate motor making a steady whir as a stir bar spins. The beaker contains a Briggs-Rauscher mixture (hydrogen peroxide, potassium iodate, malonic acid, and a metal-ion catalyst with starch indicator). While the vortex turns, the liquid repeatedly cycles through several distinct visible states in a rhythmic pattern, switching abruptly and then returning again and again as the stirring continues.

Veo 3.1 Fast

Ovi

LTX-2

Kling 2.6

Case 5: Semantic Misalignment (Vacation Ad)

Original Prompt

A young boy hits a beach ball as a group of children runs past him and jumps into a swimming pool with loud splashes, while a voiceover states, "We went on vacation with a toe dipper." The camera follows the kids underwater as bubbles roar and feet kick past the lens, and the voiceover finishes, "and left with a cannonballer." Finally, the view resurfaces to show a laughing girl in the water as on-screen text reads "Book your family home now."

Veo 3.1 Fast

Ovi

LTX-2

Kling 2.6

Case 6: Music Pitch Accuracy (Single Note A4)

Original Prompt

A zoomed-in tutorial shot of a clean-tone electric guitar fretboard and picking hand. The player frets a single note A4 and plucks it four times with even timing, letting each note ring briefly. The pitch stays stable (no bend, no vibrato), and no other strings ring.

Veo 3.1 Fast

Ovi

LTX-2

Kling 2.6

Citation

If you find AVGen-Bench useful, please cite:

@misc{zhou2026avgenbenchtaskdrivenbenchmarkmultigranular,
      title={AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation}, 
      author={Ziwei Zhou and Zeyuan Lai and Rui Wang and Yifan Yang and Zhen Xing and Yuqing Yang and Qi Dai and Lili Qiu and Chong Luo},
      year={2026},
      eprint={2604.08540},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08540}, 
}