{"title":"用于视频段落字幕的内存增强型分层变换器","authors":"Benhui Zhang , Junyu Gao , Yuan Yuan","doi":"10.1016/j.neucom.2024.128835","DOIUrl":null,"url":null,"abstract":"<div><div>Video paragraph captioning aims to describe a video that contains multiple events with a paragraph of generated coherent sentences. Such a captioning task is full of challenges since the high requirements for visual–textual relevance and semantic coherence across the captioning paragraph of a video. In this work, we introduce a memory-enhanced hierarchical transformer for video paragraph captioning. Our model adopts a hierarchical structure, where the outer layer transformer extracts visual information from a global perspective and captures the relevancy between event segments throughout the entire video, while the inner layer transformer further mines local details within each event segment. By thoroughly exploring both global and local visual information at the video and event levels, our model can provide comprehensive visual feature cues for promising paragraph caption generation. Additionally, we design a memory module to capture similar patterns among event segments within a video, which preserves contextual information across event segments and updates its memory state accordingly. Experimental results on two popular datasets, ActivityNet Captions and YouCook2, demonstrate that our proposed model can achieve superior performance, generating higher quality caption while maintaining consistency in the content of video.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"615 ","pages":"Article 128835"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Memory-enhanced hierarchical transformer for video paragraph captioning\",\"authors\":\"Benhui Zhang , Junyu Gao , Yuan Yuan\",\"doi\":\"10.1016/j.neucom.2024.128835\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Video paragraph captioning aims to describe a video that contains multiple events with a paragraph of generated coherent sentences. Such a captioning task is full of challenges since the high requirements for visual–textual relevance and semantic coherence across the captioning paragraph of a video. In this work, we introduce a memory-enhanced hierarchical transformer for video paragraph captioning. Our model adopts a hierarchical structure, where the outer layer transformer extracts visual information from a global perspective and captures the relevancy between event segments throughout the entire video, while the inner layer transformer further mines local details within each event segment. By thoroughly exploring both global and local visual information at the video and event levels, our model can provide comprehensive visual feature cues for promising paragraph caption generation. Additionally, we design a memory module to capture similar patterns among event segments within a video, which preserves contextual information across event segments and updates its memory state accordingly. Experimental results on two popular datasets, ActivityNet Captions and YouCook2, demonstrate that our proposed model can achieve superior performance, generating higher quality caption while maintaining consistency in the content of video.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"615 \",\"pages\":\"Article 128835\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224016060\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016060","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Memory-enhanced hierarchical transformer for video paragraph captioning
Video paragraph captioning aims to describe a video that contains multiple events with a paragraph of generated coherent sentences. Such a captioning task is full of challenges since the high requirements for visual–textual relevance and semantic coherence across the captioning paragraph of a video. In this work, we introduce a memory-enhanced hierarchical transformer for video paragraph captioning. Our model adopts a hierarchical structure, where the outer layer transformer extracts visual information from a global perspective and captures the relevancy between event segments throughout the entire video, while the inner layer transformer further mines local details within each event segment. By thoroughly exploring both global and local visual information at the video and event levels, our model can provide comprehensive visual feature cues for promising paragraph caption generation. Additionally, we design a memory module to capture similar patterns among event segments within a video, which preserves contextual information across event segments and updates its memory state accordingly. Experimental results on two popular datasets, ActivityNet Captions and YouCook2, demonstrate that our proposed model can achieve superior performance, generating higher quality caption while maintaining consistency in the content of video.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.