Successful Youtube video identification using multimodal deep learning

Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022) Pub Date : 2022-11-28 DOI:10.5753/kdmile.2022.227792

Lucas de Souza Rodrigues, K. Sakiyama, Leozitor Floro de Souza, E. Matsubara, B. Nogueira

引用次数: 0

Abstract

Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).

查看原文本刊更多论文

使用多模态深度学习成功识别Youtube视频

来自标题和音频转录的文本，图像缩略图，喜欢，不喜欢和视图的数量是YouTube视频中可用数据的示例。尽管存在可变性，但大多数标准深度学习模型只使用一种类型的数据。此外，对这类问题同时使用多个数据源的情况仍然很少。为了揭示这些问题，我们使用标准深度学习模型、基于resnet的SE-Net图像特征提取模型以及BERT到NLP模型，对从图像缩略图和YouTube视频标题中提取的嵌入进行了八种不同的多模态融合操作进行了实证评估。实验结果表明，简单的和或减嵌入操作可以提高模型的精度。该数据集的多模态融合操作准确率达到81.3%，比单模态模型(文本)和5.79%(视频)的准确率分别高出3.86%和5.79%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)

自引率

0.00%

发文量