Differently processed modality and appropriate model selection lead to richer representation of the multimodal input

Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde
{"title":"Differently processed modality and appropriate model selection lead to richer representation of the multimodal input","authors":"Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde","doi":"10.1007/s41870-024-02113-4","DOIUrl":null,"url":null,"abstract":"<p>We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02113-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.

Abstract Image

不同的处理模式和适当的模型选择可带来更丰富的多模式输入表征
我们的目标是有效解决和改进 Meta Meme 挑战赛,在 Meta 推出的多模态数据集上对仇恨备忘录检测进行二元分类。这一问题在单个模态处理及其对仇恨备忘录最终分类的影响方面存在挑战。与决策级融合相比,我们在提出仇恨备忘录检测解决方案时侧重于特征级融合方法,因为特征级融合能生成更丰富的特征表征供进一步处理。多模态数据处理中适当的模型选择在下游任务中发挥着重要作用。此外,与视觉模式相关的固有否定性可能无法通过视觉处理模型完全检测出来,这就需要通过其他技术对视觉数据进行不同的处理。具体来说,我们针对上述分类问题提出了两种基于特征级融合的方法,并利用 VisualBERT 有效地表示文本和视觉模式。此外,我们还采用了图像标题技术,从多模态输入的视觉模态生成文本标题,并通过张量融合网络与输入的实际文本进一步融合。我们提出的模型在准确性和 AuROC 性能指标上大大优于同类技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信