Lucas de Souza Rodrigues, K. Sakiyama, Leozitor Floro de Souza, E. Matsubara, B. Nogueira
{"title":"Successful Youtube video identification using multimodal deep learning","authors":"Lucas de Souza Rodrigues, K. Sakiyama, Leozitor Floro de Souza, E. Matsubara, B. Nogueira","doi":"10.5753/kdmile.2022.227792","DOIUrl":null,"url":null,"abstract":"Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).","PeriodicalId":417100,"journal":{"name":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/kdmile.2022.227792","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).