{"title":"基于卷积神经网络和音频特征结合的多模态视频概念分类","authors":"Berkay Selbes, M. Sert","doi":"10.1109/SIU.2017.7960723","DOIUrl":null,"url":null,"abstract":"Video concept classification is a very important task for several applications such as content based video indexing and searching. In this study, we propose a multi-modal video classification method based on the feature-level fusion of audio-visual signals. In the proposed method, we extract Mel Frequency Cepstral Coefficient (MFCC) and convolutional neural network (CNN) features from the audio and visual parts of the video signal, respectively and calculate three statistical representations of the MFCC feature vectors. We perform feature level fusion of both modalities using the concatenation operator and train Support Vector Machine (SVM) classifiers using these multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID video performance dataset for both single- and multi-modal cases. Our results show that, fusing standard deviation representation of the audio modality along with the GoogleNet CNN features improves the classification accuracy.","PeriodicalId":217576,"journal":{"name":"2017 25th Signal Processing and Communications Applications Conference (SIU)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Multimodal video concept classification based on convolutional neural network and audio feature combination\",\"authors\":\"Berkay Selbes, M. Sert\",\"doi\":\"10.1109/SIU.2017.7960723\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video concept classification is a very important task for several applications such as content based video indexing and searching. In this study, we propose a multi-modal video classification method based on the feature-level fusion of audio-visual signals. In the proposed method, we extract Mel Frequency Cepstral Coefficient (MFCC) and convolutional neural network (CNN) features from the audio and visual parts of the video signal, respectively and calculate three statistical representations of the MFCC feature vectors. We perform feature level fusion of both modalities using the concatenation operator and train Support Vector Machine (SVM) classifiers using these multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID video performance dataset for both single- and multi-modal cases. Our results show that, fusing standard deviation representation of the audio modality along with the GoogleNet CNN features improves the classification accuracy.\",\"PeriodicalId\":217576,\"journal\":{\"name\":\"2017 25th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 25th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU.2017.7960723\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 25th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU.2017.7960723","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multimodal video concept classification based on convolutional neural network and audio feature combination
Video concept classification is a very important task for several applications such as content based video indexing and searching. In this study, we propose a multi-modal video classification method based on the feature-level fusion of audio-visual signals. In the proposed method, we extract Mel Frequency Cepstral Coefficient (MFCC) and convolutional neural network (CNN) features from the audio and visual parts of the video signal, respectively and calculate three statistical representations of the MFCC feature vectors. We perform feature level fusion of both modalities using the concatenation operator and train Support Vector Machine (SVM) classifiers using these multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID video performance dataset for both single- and multi-modal cases. Our results show that, fusing standard deviation representation of the audio modality along with the GoogleNet CNN features improves the classification accuracy.