Audio Captioning Based on Combined Audio and Semantic Embeddings

Aysegül Özkaya Eren, M. Sert
{"title":"Audio Captioning Based on Combined Audio and Semantic Embeddings","authors":"Aysegül Özkaya Eren, M. Sert","doi":"10.1109/ISM.2020.00014","DOIUrl":null,"url":null,"abstract":"Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"348 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Multimedia (ISM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.
基于组合音频和语义嵌入的音频字幕
音频字幕是最近提出的一项任务,用于自动生成给定音频片段的文本描述。大多数现有的方法使用编码器-解码器模型而不使用语义信息。在这项研究中,我们提出了一个双向门控循环单元(BiGRU)模型,该模型基于音频和语义嵌入的编码器-解码器架构。为了获得语义嵌入,我们使用音频字幕中的主语和动词提取主谓嵌入。在测试阶段,我们使用多层感知器分类器来预测测试音频片段的主谓嵌入。在提取音频特征的目的中,除了对数Mel能量外,我们还使用了预训练的音频神经网络(PANN)作为特征提取器,该方法首次用于音频字幕任务,以探索音频嵌入在音频字幕任务中的可用性。我们结合音频嵌入和语义嵌入来提供基于bigru的编码器-解码器模型。接下来,我们在两个音频字幕数据集上评估我们的模型:Clotho和AudioCaps。实验结果表明,所提出的基于bigru的深度模型在不同评价指标上的表现明显优于目前的结果,并且语义信息的包含增强了字幕性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信