使用分段级自动编码器的广告多模态表示

Krishna Somandepalli, Victor R. Martinez, Naveen Kumar, Shrikanth S. Narayanan
{"title":"使用分段级自动编码器的广告多模态表示","authors":"Krishna Somandepalli, Victor R. Martinez, Naveen Kumar, Shrikanth S. Narayanan","doi":"10.1145/3242969.3243026","DOIUrl":null,"url":null,"abstract":"Automatic analysis of advertisements (ads) poses an interesting problem for learning multimodal representations. A promising direction of research is the development of deep neural network autoencoders to obtain inter-modal and intra-modal representations. In this work, we propose a system to obtain segment-level unimodal and joint representations. These features are concatenated, and then averaged across the duration of an ad to obtain a single multimodal representation. The autoencoders are trained using segments generated by time-aligning frames between the audio and video modalities with forward and backward context. In order to assess the multimodal representations, we consider the tasks of classifying an ad as funny or exciting in a publicly available dataset of 2,720 ads. For this purpose we train the segment-level autoencoders on a larger, unlabeled dataset of 9,740 ads, agnostic of the test set. Our experiments show that: 1) the multimodal representations outperform joint and unimodal representations, 2) the different representations we learn are complementary to each other, and 3) the segment-level multimodal representations perform better than classical autoencoders and cross-modal representations -- within the context of the two classification tasks. We obtain an improvement of about 5% in classification accuracy compared to a competitive baseline.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Multimodal Representation of Advertisements Using Segment-level Autoencoders\",\"authors\":\"Krishna Somandepalli, Victor R. Martinez, Naveen Kumar, Shrikanth S. Narayanan\",\"doi\":\"10.1145/3242969.3243026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic analysis of advertisements (ads) poses an interesting problem for learning multimodal representations. A promising direction of research is the development of deep neural network autoencoders to obtain inter-modal and intra-modal representations. In this work, we propose a system to obtain segment-level unimodal and joint representations. These features are concatenated, and then averaged across the duration of an ad to obtain a single multimodal representation. The autoencoders are trained using segments generated by time-aligning frames between the audio and video modalities with forward and backward context. In order to assess the multimodal representations, we consider the tasks of classifying an ad as funny or exciting in a publicly available dataset of 2,720 ads. For this purpose we train the segment-level autoencoders on a larger, unlabeled dataset of 9,740 ads, agnostic of the test set. Our experiments show that: 1) the multimodal representations outperform joint and unimodal representations, 2) the different representations we learn are complementary to each other, and 3) the segment-level multimodal representations perform better than classical autoencoders and cross-modal representations -- within the context of the two classification tasks. We obtain an improvement of about 5% in classification accuracy compared to a competitive baseline.\",\"PeriodicalId\":308751,\"journal\":{\"name\":\"Proceedings of the 20th ACM International Conference on Multimodal Interaction\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th ACM International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3242969.3243026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242969.3243026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

广告的自动分析为学习多模态表示提出了一个有趣的问题。一个有前途的研究方向是开发深度神经网络自编码器来获得模态间和模态内表示。在这项工作中,我们提出了一个系统来获得段级单峰和联合表示。这些特征被连接起来,然后在广告的持续时间内平均,以获得一个单一的多模态表示。自动编码器使用由具有向前和向后上下文的音频和视频模式之间的时间对齐帧生成的片段进行训练。为了评估多模态表示,我们考虑在一个包含2720个广告的公开数据集中将广告分类为有趣或令人兴奋的任务。为此,我们在一个包含9740个广告的更大的未标记数据集上训练片段级自动编码器,与测试集无关。我们的实验表明:1)多模态表征优于联合表征和单模态表征,2)我们学习的不同表征是相互补充的,3)在两个分类任务的背景下,片段级多模态表征比经典自动编码器和跨模态表征表现得更好。与竞争基准相比,我们获得了大约5%的分类精度提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multimodal Representation of Advertisements Using Segment-level Autoencoders
Automatic analysis of advertisements (ads) poses an interesting problem for learning multimodal representations. A promising direction of research is the development of deep neural network autoencoders to obtain inter-modal and intra-modal representations. In this work, we propose a system to obtain segment-level unimodal and joint representations. These features are concatenated, and then averaged across the duration of an ad to obtain a single multimodal representation. The autoencoders are trained using segments generated by time-aligning frames between the audio and video modalities with forward and backward context. In order to assess the multimodal representations, we consider the tasks of classifying an ad as funny or exciting in a publicly available dataset of 2,720 ads. For this purpose we train the segment-level autoencoders on a larger, unlabeled dataset of 9,740 ads, agnostic of the test set. Our experiments show that: 1) the multimodal representations outperform joint and unimodal representations, 2) the different representations we learn are complementary to each other, and 3) the segment-level multimodal representations perform better than classical autoencoders and cross-modal representations -- within the context of the two classification tasks. We obtain an improvement of about 5% in classification accuracy compared to a competitive baseline.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信