{"title":"多模态命名实体识别的自适应多尺度语言强化","authors":"Enping Li;Tianrui Li;Huaishao Luo;Jielei Chu;Lixin Duan;Fengmao Lv","doi":"10.1109/TMM.2025.3543105","DOIUrl":null,"url":null,"abstract":"Over the recent years, multimodal named entity recognition has gained increasing attentions due to its wide applications in social media. The key factor of multimodal named entity recognition is to effectively fuse information of different modalities. Existing works mainly focus on reinforcing textual representations by fusing image features via the cross-modal attention mechanism. However, these works are limited in reinforcing the text modality at the token level. As a named entity usually contains several tokens, modeling token-level inter-modal interactions is suboptimal for the multimodal named entity recognition problem. In this work, we propose a multimodal named entity recognition approach dubbed Adaptive Multi-scale Language Reinforcement (AMLR) to implement entity-level language reinforcement. To this end, our model first expands token-level textual representations into multi-scale textual representations which are composed of language units of different lengths. After that, the visual information reinforces the language modality by modeling the cross-modal attention between images and expanded multi-scale textual representations. Unlike existing token-level language reinforcement methods, the word sequences of named entities can be directly interacted with the visual features as a whole, making the modeled cross-modal correlations more reasonable. Although the underlying entity is not given, the training procedure can encourage the relevant image contents to adaptively attend to the appropriate language units, making our approach not rely on the pipeline design. Comprehensive evaluation results on two public Twitter datasets clearly demonstrate the superiority of our proposed model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5312-5323"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive Multi-Scale Language Reinforcement for Multimodal Named Entity Recognition\",\"authors\":\"Enping Li;Tianrui Li;Huaishao Luo;Jielei Chu;Lixin Duan;Fengmao Lv\",\"doi\":\"10.1109/TMM.2025.3543105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the recent years, multimodal named entity recognition has gained increasing attentions due to its wide applications in social media. The key factor of multimodal named entity recognition is to effectively fuse information of different modalities. Existing works mainly focus on reinforcing textual representations by fusing image features via the cross-modal attention mechanism. However, these works are limited in reinforcing the text modality at the token level. As a named entity usually contains several tokens, modeling token-level inter-modal interactions is suboptimal for the multimodal named entity recognition problem. In this work, we propose a multimodal named entity recognition approach dubbed Adaptive Multi-scale Language Reinforcement (AMLR) to implement entity-level language reinforcement. To this end, our model first expands token-level textual representations into multi-scale textual representations which are composed of language units of different lengths. After that, the visual information reinforces the language modality by modeling the cross-modal attention between images and expanded multi-scale textual representations. Unlike existing token-level language reinforcement methods, the word sequences of named entities can be directly interacted with the visual features as a whole, making the modeled cross-modal correlations more reasonable. Although the underlying entity is not given, the training procedure can encourage the relevant image contents to adaptively attend to the appropriate language units, making our approach not rely on the pipeline design. Comprehensive evaluation results on two public Twitter datasets clearly demonstrate the superiority of our proposed model.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"5312-5323\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10891515/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891515/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Adaptive Multi-Scale Language Reinforcement for Multimodal Named Entity Recognition
Over the recent years, multimodal named entity recognition has gained increasing attentions due to its wide applications in social media. The key factor of multimodal named entity recognition is to effectively fuse information of different modalities. Existing works mainly focus on reinforcing textual representations by fusing image features via the cross-modal attention mechanism. However, these works are limited in reinforcing the text modality at the token level. As a named entity usually contains several tokens, modeling token-level inter-modal interactions is suboptimal for the multimodal named entity recognition problem. In this work, we propose a multimodal named entity recognition approach dubbed Adaptive Multi-scale Language Reinforcement (AMLR) to implement entity-level language reinforcement. To this end, our model first expands token-level textual representations into multi-scale textual representations which are composed of language units of different lengths. After that, the visual information reinforces the language modality by modeling the cross-modal attention between images and expanded multi-scale textual representations. Unlike existing token-level language reinforcement methods, the word sequences of named entities can be directly interacted with the visual features as a whole, making the modeled cross-modal correlations more reasonable. Although the underlying entity is not given, the training procedure can encourage the relevant image contents to adaptively attend to the appropriate language units, making our approach not rely on the pipeline design. Comprehensive evaluation results on two public Twitter datasets clearly demonstrate the superiority of our proposed model.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.