AVCaps: An Audio-Visual Dataset With Modality-Specific Captions

IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2025-06-09 DOI:10.1109/OJSP.2025.3578296

Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen

{"title":"AVCaps: An Audio-Visual Dataset With Modality-Specific Captions","authors":"Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen","doi":"10.1109/OJSP.2025.3578296","DOIUrl":null,"url":null,"abstract":"This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"691-704"},"PeriodicalIF":2.9000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029114","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11029114/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.

查看原文本刊更多论文

AVCaps：具有模态特定标题的视听数据集

本文介绍了AVCaps，这是一个视听数据集，它包含视频剪辑的音频、视觉和视听内容的单独文本标题。该数据集包含2061个视频片段，总计28.8小时。我们为每个片段的音频、视频和视听内容提供最多5个字幕，分别众包。现有的数据集中在单一的模态上，或者没有提供特定于模态的说明，这限制了对多模态环境中每种模态如何有助于整体理解的研究。我们的数据集解决了多模态研究中的这一关键空白，提供了一个资源来研究音频和视觉内容如何单独添加字幕，以及视听内容如何与这些单独的模态相关。众包视听字幕更倾向于视觉内容而非音频内容。为了避免这种情况，我们使用大型语言模型（llm）基于众包字幕为每个片段生成三个平衡的视听字幕。我们提出了标题和检索实验，以说明模式特定的标题在评估模型性能方面的有效性。具体来说，我们表明，特定于模态的字幕允许我们定量地评估模型对给定视频的音频和视觉信息的理解程度。值得注意的是，我们发现在平衡的llm生成的视听字幕上训练的模型比在众包视听字幕上训练的模型更有效地捕获音频信息。该模型在众包音频字幕上的句子- bert相似度比在众包视听字幕上训练的模型高14%，后者通常更倾向于视觉信息。我们还讨论了多模态表示学习、问题回答、开发新的视频字幕指标以及该数据集解锁的生成式人工智能的可能性。该数据集可以在Zenodo和hug Face上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊