{"title":"MI3S:一个多模态大语言模型辅助人工智能生成谈话头的质量评估框架","authors":"Yingjie Zhou, Zicheng Zhang, Sijing Wu, Jun Jia, Yanwei Jiang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai","doi":"10.1016/j.ipm.2025.104321","DOIUrl":null,"url":null,"abstract":"<div><div>Although current speech-driven technologies enable the rapid generation of AI-generated talking heads (AGTHs), human supervision remains necessary to ensure the quality of the output. However, manual evaluation becomes increasingly impractical for large-scale AGTH production due to its time-consuming and labor-intensive nature. To overcome this limitation, we propose a novel objective quality assessment framework, MI3S, which employs a <strong>M</strong>ultimodal Large Language Model (MLLM) to evaluate AGTHs across four key dimensions: <strong>I</strong>mage quality, <strong>I</strong>mage aesthetics, <strong>I</strong>dentity consistency, and <strong>S</strong>ound-lip synchronization. <strong>To capture temporal dynamics more effectively,</strong> we introduce a variable-length video memory filter (VVMF), inspired by principles of human visual cognition. The MI3S framework supports both zero-shot inference and supervised learning paradigms. On the THQA dataset comprising 800 AGTHs, MI3S achieves a prediction-human perceptual correlation coefficient of 0.7946, which exceeds that of existing quality assessment methods by 3.4%, thereby offering an efficient, robust, and objective solution for evaluating AGTH quality.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104321"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MI3S: A multimodal large language model assisted quality assessment framework for AI-generated talking heads\",\"authors\":\"Yingjie Zhou, Zicheng Zhang, Sijing Wu, Jun Jia, Yanwei Jiang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai\",\"doi\":\"10.1016/j.ipm.2025.104321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Although current speech-driven technologies enable the rapid generation of AI-generated talking heads (AGTHs), human supervision remains necessary to ensure the quality of the output. However, manual evaluation becomes increasingly impractical for large-scale AGTH production due to its time-consuming and labor-intensive nature. To overcome this limitation, we propose a novel objective quality assessment framework, MI3S, which employs a <strong>M</strong>ultimodal Large Language Model (MLLM) to evaluate AGTHs across four key dimensions: <strong>I</strong>mage quality, <strong>I</strong>mage aesthetics, <strong>I</strong>dentity consistency, and <strong>S</strong>ound-lip synchronization. <strong>To capture temporal dynamics more effectively,</strong> we introduce a variable-length video memory filter (VVMF), inspired by principles of human visual cognition. The MI3S framework supports both zero-shot inference and supervised learning paradigms. On the THQA dataset comprising 800 AGTHs, MI3S achieves a prediction-human perceptual correlation coefficient of 0.7946, which exceeds that of existing quality assessment methods by 3.4%, thereby offering an efficient, robust, and objective solution for evaluating AGTH quality.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104321\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325002626\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002626","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
MI3S: A multimodal large language model assisted quality assessment framework for AI-generated talking heads
Although current speech-driven technologies enable the rapid generation of AI-generated talking heads (AGTHs), human supervision remains necessary to ensure the quality of the output. However, manual evaluation becomes increasingly impractical for large-scale AGTH production due to its time-consuming and labor-intensive nature. To overcome this limitation, we propose a novel objective quality assessment framework, MI3S, which employs a Multimodal Large Language Model (MLLM) to evaluate AGTHs across four key dimensions: Image quality, Image aesthetics, Identity consistency, and Sound-lip synchronization. To capture temporal dynamics more effectively, we introduce a variable-length video memory filter (VVMF), inspired by principles of human visual cognition. The MI3S framework supports both zero-shot inference and supervised learning paradigms. On the THQA dataset comprising 800 AGTHs, MI3S achieves a prediction-human perceptual correlation coefficient of 0.7946, which exceeds that of existing quality assessment methods by 3.4%, thereby offering an efficient, robust, and objective solution for evaluating AGTH quality.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.