Understanding the emotions and emotions expressed are two key factors in multimodal emotional analysis. Human language is usually multimodal, including three modes: visual perception, speech, and text, and each mode contains numerous different information. For example, text mode includes basic language symbols, syntax, and language actions, while speech mode includes speech, intonation, and voice expression. Visual modalities include information such as posture features, body language, eye contact, and facial expression. Therefore, how to efficiently integrate inter modal information has become a hot topic in the field of multimodal emotion analysis. Therefore, the article proposes a cross module fusion network model. This model utilizes the LSTM network as the representation sub network for language and visual modalities, while utilizing the cross module fusion of the improved and upgraded Transformer model to effectively fuse the two modal information; In order to verify the effectiveness of the model proposed in the article, careful evaluation was conducted on the IEMOCAP and MOSEI datasets, and the results showed that the accuracy of the model for sentiment classification has been improved.
期刊介绍:
Computer Science (CS) was established in 1974, its original title was Computer Applications and Applied Mathematics until 1979. It is sponsored by Chongqing Southwest Information Co., Ltd, and is the member Journal of CCF(China Computer Federation) and the CCF B Class journal. Computer Science (CS) mainly reports the dynamic development, methodologies and techniques involving a wide range, and International advanced research productions of computer science and technology.
Computer Science (CS) has been included in many important national and international index databases, such as CSCD,GCJC, CSA ,DOAJ, IC , UPD, JST. Readers of Computer Science (CS) include the students of college, researches and technicists engaged in the field of computer science and technology