{"title":"Multimodal Sentiment Analysis using Audio and Text for Crime Detection","authors":"Mohammed Boukabous, M. Azizi","doi":"10.1109/IRASET52964.2022.9738175","DOIUrl":null,"url":null,"abstract":"Thanks to the advancement of communication technologies and the widespread use of social media networks, individuals generate daily a significant amount of data that contains valuable emotional information. During the last few decades, most of the research in sentiment analysis has concentrated on textual sentiment analysis, which has been accomplished through text mining techniques. Audio sentiment analysis, on the other hand, is still in its infancy and started to attract the scientific community. In this paper, we use the XD-Violence dataset to develop a multimodal learning model that predicts crimes by incorporating both audio and text modalities into the same model. As an initial step, we benchmark the dataset on audio using CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) before moving on to text using BERT (Bidirectional Encoder Representations from Transformers). Finally, we combine CNN and BERT to get the best results with an accuracy of 85,63 %, a loss of 30,47%, and an F1-score of 85,16%.","PeriodicalId":377115,"journal":{"name":"2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRASET52964.2022.9738175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Thanks to the advancement of communication technologies and the widespread use of social media networks, individuals generate daily a significant amount of data that contains valuable emotional information. During the last few decades, most of the research in sentiment analysis has concentrated on textual sentiment analysis, which has been accomplished through text mining techniques. Audio sentiment analysis, on the other hand, is still in its infancy and started to attract the scientific community. In this paper, we use the XD-Violence dataset to develop a multimodal learning model that predicts crimes by incorporating both audio and text modalities into the same model. As an initial step, we benchmark the dataset on audio using CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) before moving on to text using BERT (Bidirectional Encoder Representations from Transformers). Finally, we combine CNN and BERT to get the best results with an accuracy of 85,63 %, a loss of 30,47%, and an F1-score of 85,16%.