Evaluating Early Fusion Operators at Mid-Level Feature Space

Antonio A. R. Beserra, R. M. Kishi, R. Goularte
{"title":"Evaluating Early Fusion Operators at Mid-Level Feature Space","authors":"Antonio A. R. Beserra, R. M. Kishi, R. Goularte","doi":"10.1145/3428658.3431079","DOIUrl":null,"url":null,"abstract":"Early fusion techniques have been proposed in video analysis tasks as a way to improve efficacy by generating compact data models capable of keeping semantic clues present on multimodal data. First attempts to fuse multimodal data employed fusion operators at low-level feature space, losing data representativeness. This drove later research efforts to evolve simple operators to complex operations, which became, in general, inseparable of the multimodal semantic clues processing. In this paper, we investigate the application of early multimodal fusion operators at the mid-level feature space. Five different operators (Concatenation, Sum, Gram, Average and Maximum) were employed to fuse mid-level multimodal video features. Fused data derived from each operator were then used as input for two different video analysis tasks: Temporal Video Scene Segmentation and Video Classification. For each task, we performed a comparative analysis among the operators and related work techniques designed for these tasks using complex fusion operations. The efficacy results reached by the operators were very close to those reached by the techniques, pointing out strong evidence that working on a more homogeneous feature space can reduce known low-level fusion drawbacks. In addition, operators make data fusion separable, allowing researchers to keep the focus on developing semantic clues representations.","PeriodicalId":350776,"journal":{"name":"Proceedings of the Brazilian Symposium on Multimedia and the Web","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Brazilian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3428658.3431079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Early fusion techniques have been proposed in video analysis tasks as a way to improve efficacy by generating compact data models capable of keeping semantic clues present on multimodal data. First attempts to fuse multimodal data employed fusion operators at low-level feature space, losing data representativeness. This drove later research efforts to evolve simple operators to complex operations, which became, in general, inseparable of the multimodal semantic clues processing. In this paper, we investigate the application of early multimodal fusion operators at the mid-level feature space. Five different operators (Concatenation, Sum, Gram, Average and Maximum) were employed to fuse mid-level multimodal video features. Fused data derived from each operator were then used as input for two different video analysis tasks: Temporal Video Scene Segmentation and Video Classification. For each task, we performed a comparative analysis among the operators and related work techniques designed for these tasks using complex fusion operations. The efficacy results reached by the operators were very close to those reached by the techniques, pointing out strong evidence that working on a more homogeneous feature space can reduce known low-level fusion drawbacks. In addition, operators make data fusion separable, allowing researchers to keep the focus on developing semantic clues representations.
中期特征空间早期融合算子评估
早期的融合技术已经在视频分析任务中被提出,作为一种通过生成能够在多模态数据上保持语义线索的紧凑数据模型来提高效率的方法。第一次尝试融合多模态数据时,在底层特征空间使用融合算子,失去了数据的代表性。这促使后来的研究努力将简单的操作演变为复杂的操作,这在一般情况下与多模态语义线索处理是分不开的。本文研究了早期多模态融合算子在中级特征空间中的应用。采用五种不同的算子(concatation, Sum, Gram, Average和Maximum)融合中级多模态视频特征。然后,将每个算子的融合数据作为两个不同视频分析任务的输入:时间视频场景分割和视频分类。对于每个任务,我们使用复杂的融合操作对操作员和为这些任务设计的相关工作技术进行了比较分析。操作人员达到的效果结果非常接近技术所达到的效果,指出了强有力的证据,表明在更均匀的特征空间上工作可以减少已知的低水平融合缺陷。此外,操作员使数据融合可分离,使研究人员能够专注于开发语义线索表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信