Understanding the emotions and emotions expressed are two key factors in multimodal emotional analysis. Human language is usually multimodal, including three modes: visual perception, speech, and text, and each mode contains numerous different information. For example, text mode includes basic language symbols, syntax, and language actions, while speech mode includes speech, intonation, and voice expression. Visual modalities include information such as posture features, body language, eye contact, and facial expression. Therefore, how to efficiently integrate inter modal information has become a hot topic in the field of multimodal emotion analysis. Therefore, the article proposes a cross module fusion network model. This model utilizes the LSTM network as the representation sub network for language and visual modalities, while utilizing the cross module fusion of the improved and upgraded Transformer model to effectively fuse the two modal information; In order to verify the effectiveness of the model proposed in the article, careful evaluation was conducted on the IEMOCAP and MOSEI datasets, and the results showed that the accuracy of the model for sentiment classification has been improved.