{"title":"A vision and language hierarchical alignment for multimodal aspect-based sentiment analysis","authors":"Wang Zou, Xia Sun, Qiang Lu, Xuxin Wang, Jun Feng","doi":"10.1016/j.patcog.2025.111369","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, Multimodal Aspect-Based Sentiment Analysis (MABSA) has garnered attention from researchers. The MABSA technology can effectively perform Aspect Term Extraction (MATE) and Aspect Sentiment Classification (MASC) for Multimodal data. However, current MABSA work focuses on visual semantic information while neglecting the scene structure of images. Additionally, researchers using static alignment matrices cannot effectively capture complex vision features, such as spatial and action features among objects. In this paper, we propose a Vision and Language Hierarchical Alignment method (VLHA) for the MABSA task. The VLHA framework includes three modules: the multimodal structural alignment module, the multimodal semantic alignment module, and the cross-modal MABSA module. Firstly, we process the vision modality into a visual scene graph and image patches, and the text modality into a text dependency graph and word sequences. Secondly, we use the structural alignment module to achieve dynamic alignment learning between the visual scene graph and text dependency graph, and the semantic alignment module to achieve dynamic alignment learning between image patches and word sequences. Finally, we concatenate and fuse structural and semantic features in the cross-modal MABSA module. Additionally, VLHA designs a three-dimensional dynamic alignment matrix to guide the cross-attention for modal interaction learning. We conducted a series of experiments on two Twitter datasets, and the results show that the performance of the VLHA framework outperforms the baseline models. The structure of the visual modality facilitates the model in comprehensively understanding complex visual information.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111369"},"PeriodicalIF":7.5000,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325000299","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Multimodal Aspect-Based Sentiment Analysis (MABSA) has garnered attention from researchers. The MABSA technology can effectively perform Aspect Term Extraction (MATE) and Aspect Sentiment Classification (MASC) for Multimodal data. However, current MABSA work focuses on visual semantic information while neglecting the scene structure of images. Additionally, researchers using static alignment matrices cannot effectively capture complex vision features, such as spatial and action features among objects. In this paper, we propose a Vision and Language Hierarchical Alignment method (VLHA) for the MABSA task. The VLHA framework includes three modules: the multimodal structural alignment module, the multimodal semantic alignment module, and the cross-modal MABSA module. Firstly, we process the vision modality into a visual scene graph and image patches, and the text modality into a text dependency graph and word sequences. Secondly, we use the structural alignment module to achieve dynamic alignment learning between the visual scene graph and text dependency graph, and the semantic alignment module to achieve dynamic alignment learning between image patches and word sequences. Finally, we concatenate and fuse structural and semantic features in the cross-modal MABSA module. Additionally, VLHA designs a three-dimensional dynamic alignment matrix to guide the cross-attention for modal interaction learning. We conducted a series of experiments on two Twitter datasets, and the results show that the performance of the VLHA framework outperforms the baseline models. The structure of the visual modality facilitates the model in comprehensively understanding complex visual information.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.