视频胜过千张图片：探索长视频生成的最新趋势

IF 28 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

ACM Computing Surveys Pub Date : 2025-10-10 DOI:10.1145/3771724

Faraz Waseem, Muhammad Shahzad

{"title":"视频胜过千张图片：探索长视频生成的最新趋势","authors":"Faraz Waseem, Muhammad Shahzad","doi":"10.1145/3771724","DOIUrl":null,"url":null,"abstract":"An image may convey a thousand words, but a video, composed of hundreds or thousands of image frames, tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI’s Sora [1], the current state-of-the-art system, is still limited to producing videos of up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions. Critical elements, such as planning, narrative construction, and spatiotemporal continuity, pose significant challenges. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques such as GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"114 1","pages":""},"PeriodicalIF":28.0000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Video is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation\",\"authors\":\"Faraz Waseem, Muhammad Shahzad\",\"doi\":\"10.1145/3771724\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An image may convey a thousand words, but a video, composed of hundreds or thousands of image frames, tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI’s Sora [1], the current state-of-the-art system, is still limited to producing videos of up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions. Critical elements, such as planning, narrative construction, and spatiotemporal continuity, pose significant challenges. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques such as GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.\",\"PeriodicalId\":50926,\"journal\":{\"name\":\"ACM Computing Surveys\",\"volume\":\"114 1\",\"pages\":\"\"},\"PeriodicalIF\":28.0000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Computing Surveys\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3771724\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3771724","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

一张图片可以表达千言万语，但由成百上千个图像帧组成的视频讲述了一个更复杂的故事。尽管多模态大语言模型（mllm）取得了重大进展，但生成扩展视频仍然是一个艰巨的挑战。在撰写本文时，OpenAI的Sora[1]（目前最先进的系统）仍然局限于制作长度不超过一分钟的视频。这种限制源于长视频生成的复杂性，它需要的不仅仅是生成式AI技术来近似密度函数。关键要素，如规划、叙事结构和时空连续性，构成了重大挑战。将生成式人工智能与分而治之的方法相结合，可以提高长视频的可扩展性，同时提供更好的控制。在本调查中，我们研究了长视频生成的现状，涵盖了gan和扩散模型等基础技术，视频生成策略，大规模训练数据集，长视频评估的质量指标，以及解决现有视频生成能力局限性的未来研究领域。我们相信它将作为一个全面的基础，提供广泛的信息来指导长视频生成领域的未来发展和研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Video is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation

An image may convey a thousand words, but a video, composed of hundreds or thousands of image frames, tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI’s Sora [1], the current state-of-the-art system, is still limited to producing videos of up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions. Critical elements, such as planning, narrative construction, and spatiotemporal continuity, pose significant challenges. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques such as GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Computing Surveys 工程技术-计算机：理论方法

CiteScore

33.20

自引率

0.60%

发文量

372

审稿时长

12 months

期刊介绍： ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods. ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.