Comparison of early and late fusion techniques for movie trailer genre labelling

J. H. Mervitz, J. D. Villiers, J. P. Jacobs, M. H. O. Kloppers
{"title":"Comparison of early and late fusion techniques for movie trailer genre labelling","authors":"J. H. Mervitz, J. D. Villiers, J. P. Jacobs, M. H. O. Kloppers","doi":"10.23919/FUSION45008.2020.9190344","DOIUrl":null,"url":null,"abstract":"In this paper we explore automatic genre labelling of motion picture previews using audio-visual features present in movie trailers and the focus is on fusion techniques (early fusion and late fusion) and the resultant improvement on classification accuracy. This paper proposes a novel combination of deep learned features (from a pretrained VGG-16 model) obtained using a state-of-the-art shot detector and hand-crafted audio features. This combination of features and an associated comparison of early and late fusion with these features has not been attempted in the literature before. Furthermore, two popular fusion techniques and three distinct classification algorithms are investigated to determine the optimal fusion technique and classifier combination. The study uses a subset of the LMTD-9 movie trailer dataset with selected genres (action, comedy, drama and horror). The best performing low-level audio features are comprised of timbre features extracted using the MIRtoolbox followed by standalone mel-frequency cepstral coefficients. The best performing high-level audio feature is tonality. Audio features are augmented by visual features extracted using a pre-trained convolutional neural network (VGG-16). Feature fusion (early and late fusion) methods are investigated together with classification methods such as extreme gradient boosting, support vector machine and a neural network. Evaluation metrics such as precision, recall, confusion matrices and F1 score are used to measure classification accuracy. Early fusion methods outperform late fusion methods with a classification performance gain of approximately 10% for a four class classification problem. The best classification performance for early fusion obtained with a support vector machine is (73.12% accuracy), followed by the extreme gradient boosting classifier (69.37% accuracy) and neural network classifier (67.50% accuracy), whereas chance is 25%. It is shown that superior classification performance can be achieved by employing early feature fusion of low-level audio descriptors, high-level audio descriptors and high-level visual feature descriptors together with suitable classifiers.","PeriodicalId":419881,"journal":{"name":"2020 IEEE 23rd International Conference on Information Fusion (FUSION)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 23rd International Conference on Information Fusion (FUSION)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/FUSION45008.2020.9190344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In this paper we explore automatic genre labelling of motion picture previews using audio-visual features present in movie trailers and the focus is on fusion techniques (early fusion and late fusion) and the resultant improvement on classification accuracy. This paper proposes a novel combination of deep learned features (from a pretrained VGG-16 model) obtained using a state-of-the-art shot detector and hand-crafted audio features. This combination of features and an associated comparison of early and late fusion with these features has not been attempted in the literature before. Furthermore, two popular fusion techniques and three distinct classification algorithms are investigated to determine the optimal fusion technique and classifier combination. The study uses a subset of the LMTD-9 movie trailer dataset with selected genres (action, comedy, drama and horror). The best performing low-level audio features are comprised of timbre features extracted using the MIRtoolbox followed by standalone mel-frequency cepstral coefficients. The best performing high-level audio feature is tonality. Audio features are augmented by visual features extracted using a pre-trained convolutional neural network (VGG-16). Feature fusion (early and late fusion) methods are investigated together with classification methods such as extreme gradient boosting, support vector machine and a neural network. Evaluation metrics such as precision, recall, confusion matrices and F1 score are used to measure classification accuracy. Early fusion methods outperform late fusion methods with a classification performance gain of approximately 10% for a four class classification problem. The best classification performance for early fusion obtained with a support vector machine is (73.12% accuracy), followed by the extreme gradient boosting classifier (69.37% accuracy) and neural network classifier (67.50% accuracy), whereas chance is 25%. It is shown that superior classification performance can be achieved by employing early feature fusion of low-level audio descriptors, high-level audio descriptors and high-level visual feature descriptors together with suitable classifiers.
电影预告片类型标记的早期和晚期融合技术的比较
在本文中,我们利用电影预告片中的视听特征探索电影预告片的自动类型标记,重点是融合技术(早期融合和晚期融合)以及由此带来的分类准确性的提高。本文提出了一种利用最先进的镜头检测器和手工制作的音频特征获得的深度学习特征(来自预训练的VGG-16模型)的新组合。这种特征的组合以及与这些特征的早期和晚期融合的相关比较在以前的文献中没有尝试过。此外,研究了两种流行的融合技术和三种不同的分类算法,以确定最佳的融合技术和分类器组合。该研究使用了LMTD-9电影预告片数据集的一个子集,其中包含选定的类型(动作、喜剧、戏剧和恐怖)。表现最好的低级音频特征由使用MIRtoolbox提取的音色特征组成,然后是独立的梅尔频率倒谱系数。表现最好的高级音频特性是调性。音频特征通过使用预训练的卷积神经网络(VGG-16)提取的视觉特征进行增强。研究了特征融合(早期和晚期融合)方法以及极端梯度增强、支持向量机和神经网络等分类方法。评估指标,如精度,召回率,混淆矩阵和F1分数用于衡量分类准确性。对于四类分类问题,早期融合方法比晚期融合方法的分类性能提高约10%。支持向量机获得的早期融合分类性能最佳(准确率为73.12%),其次是极端梯度增强分类器(准确率为69.37%)和神经网络分类器(准确率为67.50%),准确率为25%。结果表明,将低级音频描述符、高级音频描述符和高级视觉特征描述符进行早期特征融合,结合合适的分类器,可以获得较好的分类性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信