Successful Youtube video identification using multimodal deep learning

Lucas de Souza Rodrigues, K. Sakiyama, Leozitor Floro de Souza, E. Matsubara, B. Nogueira
{"title":"Successful Youtube video identification using multimodal deep learning","authors":"Lucas de Souza Rodrigues, K. Sakiyama, Leozitor Floro de Souza, E. Matsubara, B. Nogueira","doi":"10.5753/kdmile.2022.227792","DOIUrl":null,"url":null,"abstract":"Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).","PeriodicalId":417100,"journal":{"name":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/kdmile.2022.227792","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).
使用多模态深度学习成功识别Youtube视频
来自标题和音频转录的文本,图像缩略图,喜欢,不喜欢和视图的数量是YouTube视频中可用数据的示例。尽管存在可变性,但大多数标准深度学习模型只使用一种类型的数据。此外,对这类问题同时使用多个数据源的情况仍然很少。为了揭示这些问题,我们使用标准深度学习模型、基于resnet的SE-Net图像特征提取模型以及BERT到NLP模型,对从图像缩略图和YouTube视频标题中提取的嵌入进行了八种不同的多模态融合操作进行了实证评估。实验结果表明,简单的和或减嵌入操作可以提高模型的精度。该数据集的多模态融合操作准确率达到81.3%,比单模态模型(文本)和5.79%(视频)的准确率分别高出3.86%和5.79%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信