T2VEval: Benchmark dataset and objective evaluation method for T2V-generated videos

IF 3.4 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2025-08-14 DOI:10.1016/j.displa.2025.103178

Zelu Qi , Ping Shi , Shuqi Wang , Chaoyang Zhang , Fei Zhao , Zefeng Ying , Da Pan , Xi Yang , Zheqi He , Teng Dai

{"title":"T2VEval: Benchmark dataset and objective evaluation method for T2V-generated videos","authors":"Zelu Qi , Ping Shi , Shuqi Wang , Chaoyang Zhang , Fei Zhao , Zefeng Ying , Da Pan , Xi Yang , Zheqi He , Teng Dai","doi":"10.1016/j.displa.2025.103178","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advances in text-to-video (T2V) technology, as demonstrated by models such as Runway Gen-3, Pika, Sora, and Kling, have significantly broadened the applicability and popularity of the technology. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of T2V-generated videos and optimize video generation models. However, assessing the quality of text-to-video outputs remain challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed T2VEval-Bench, a multi-dimensional benchmark dataset for text-to-video quality evaluation, which contains 148 textual prompts and 1,783 videos generated by 13 T2V models. To ensure a comprehensive evaluation, we scored each video on four dimensions in the subjective experiment, which are overall impression, text–video consistency, realness, and technical quality. Based on T2VEval-Bench, we developed T2VEval, a multi-branch fusion scheme for T2V quality evaluation. T2VEval assesses videos across three branches: text–video consistency, realness, and technical quality. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large language model. Additionally, we implemented a divide-and-conquer training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"91 ","pages":"Article 103178"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S014193822500215X","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in text-to-video (T2V) technology, as demonstrated by models such as Runway Gen-3, Pika, Sora, and Kling, have significantly broadened the applicability and popularity of the technology. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of T2V-generated videos and optimize video generation models. However, assessing the quality of text-to-video outputs remain challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed T2VEval-Bench, a multi-dimensional benchmark dataset for text-to-video quality evaluation, which contains 148 textual prompts and 1,783 videos generated by 13 T2V models. To ensure a comprehensive evaluation, we scored each video on four dimensions in the subjective experiment, which are overall impression, text–video consistency, realness, and technical quality. Based on T2VEval-Bench, we developed T2VEval, a multi-branch fusion scheme for T2V quality evaluation. T2VEval assesses videos across three branches: text–video consistency, realness, and technical quality. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large language model. Additionally, we implemented a divide-and-conquer training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics.

查看原文本刊更多论文

T2VEval: t2v生成视频的基准数据集和客观评价方法

文本到视频（T2V）技术的最新进展，如Runway Gen-3、Pika、Sora和Kling等模型所证明的那样，大大拓宽了该技术的适用性和普及程度。这一进展导致对准确质量评估指标的需求不断增长，以评估ttv生成的视频的感知质量，并优化视频生成模型。然而，由于存在高度复杂的扭曲，例如违背人类认知的非自然行为和现象，评估文本到视频输出的质量仍然具有挑战性。为了解决这些挑战，我们构建了T2V eval - bench，这是一个用于文本到视频质量评估的多维基准数据集，其中包含由13个T2V模型生成的148个文本提示和1783个视频。为了保证评价的全面性，我们在主观实验中从整体印象、文字视频一致性、真实性和技术质量四个维度对每个视频进行打分。在T2VEval- bench的基础上，提出了T2VEval多分支融合的T2V质量评估方案。T2VEval从三个方面对视频进行评估：文本视频一致性、真实性和技术质量。使用基于注意力的融合模块，T2VEval有效地整合了每个分支的特征，并借助大型语言模型预测分数。此外，我们实施了分而治之的培训策略，使每个分支都能学习目标知识，同时保持与其他分支的协同作用。实验结果表明，T2VEval在多个指标上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.