Audio Captioning Based on Combined Audio and Semantic Embeddings

2020 IEEE International Symposium on Multimedia (ISM) Pub Date : 2020-12-01 DOI:10.1109/ISM.2020.00014

Aysegül Özkaya Eren, M. Sert

{"title":"Audio Captioning Based on Combined Audio and Semantic Embeddings","authors":"Aysegül Özkaya Eren, M. Sert","doi":"10.1109/ISM.2020.00014","DOIUrl":null,"url":null,"abstract":"Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"348 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Multimedia (ISM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.

查看原文本刊更多论文

基于组合音频和语义嵌入的音频字幕

音频字幕是最近提出的一项任务，用于自动生成给定音频片段的文本描述。大多数现有的方法使用编码器-解码器模型而不使用语义信息。在这项研究中，我们提出了一个双向门控循环单元(BiGRU)模型，该模型基于音频和语义嵌入的编码器-解码器架构。为了获得语义嵌入，我们使用音频字幕中的主语和动词提取主谓嵌入。在测试阶段，我们使用多层感知器分类器来预测测试音频片段的主谓嵌入。在提取音频特征的目的中，除了对数Mel能量外，我们还使用了预训练的音频神经网络(PANN)作为特征提取器，该方法首次用于音频字幕任务，以探索音频嵌入在音频字幕任务中的可用性。我们结合音频嵌入和语义嵌入来提供基于bigru的编码器-解码器模型。接下来，我们在两个音频字幕数据集上评估我们的模型:Clotho和AudioCaps。实验结果表明，所提出的基于bigru的深度模型在不同评价指标上的表现明显优于目前的结果，并且语义信息的包含增强了字幕性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on Multimedia (ISM)

自引率

0.00%

发文量