Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen
{"title":"AVCaps: An Audio-Visual Dataset With Modality-Specific Captions","authors":"Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen","doi":"10.1109/OJSP.2025.3578296","DOIUrl":null,"url":null,"abstract":"This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"691-704"},"PeriodicalIF":2.9000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029114","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11029114/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.