Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde
{"title":"Differently processed modality and appropriate model selection lead to richer representation of the multimodal input","authors":"Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde","doi":"10.1007/s41870-024-02113-4","DOIUrl":null,"url":null,"abstract":"<p>We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02113-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.
我们的目标是有效解决和改进 Meta Meme 挑战赛,在 Meta 推出的多模态数据集上对仇恨备忘录检测进行二元分类。这一问题在单个模态处理及其对仇恨备忘录最终分类的影响方面存在挑战。与决策级融合相比,我们在提出仇恨备忘录检测解决方案时侧重于特征级融合方法,因为特征级融合能生成更丰富的特征表征供进一步处理。多模态数据处理中适当的模型选择在下游任务中发挥着重要作用。此外,与视觉模式相关的固有否定性可能无法通过视觉处理模型完全检测出来,这就需要通过其他技术对视觉数据进行不同的处理。具体来说,我们针对上述分类问题提出了两种基于特征级融合的方法,并利用 VisualBERT 有效地表示文本和视觉模式。此外,我们还采用了图像标题技术,从多模态输入的视觉模态生成文本标题,并通过张量融合网络与输入的实际文本进一步融合。我们提出的模型在准确性和 AuROC 性能指标上大大优于同类技术。