1USTC, 2SJTU, 3PKUSZ, 4Tencent, 5BFA
*Corresponding Author
The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
Key Features:
The agent-based evaluation system is mainly composed of three parts: LLM-based content structure, MLLM-based judged, and patch tools. The content structurer parses the input prompt into dimension-specific content and sends it, along with the generated video, to the MLLM-based content judger. Leveraging the MLLM fundamental objective understanding capabilities and externally invoked temporally dense tools, the system assesses whether multiple dimensions of the input are accurately generated. The resulting scores and feedback are used for ranking, evaluation, and potentially supporting post-training.
We employ film and television professionals to annotated the videos generated by 8 cutting-edge models according to the established rules, in order to verify the reliability of the agent system. The above shows the information provided in the human annotation process, as well as the annotation instruction and result examples.
(a) The distribution of scores given by humans and agent systems to each dimension of 8 models.
(b) Alignment ratios of agent evaluation to human evaluation on different models across multiple dimensions.
Comparisons among Vbench operators, our agent system, and human rankings on several evaluation dimensions.
We will continue to update the results with model releases and version updates.
If there are any results that you would like to showcase, feel free to reach out to: