Xiaoqin Lin , Chentao Han , Jian Yao , Yue Li , Xujun Wang , Shufeng Jia
{"title":"面向信息检索的知识对齐多模态转换器","authors":"Xiaoqin Lin , Chentao Han , Jian Yao , Yue Li , Xujun Wang , Shufeng Jia","doi":"10.1016/j.aej.2025.06.055","DOIUrl":null,"url":null,"abstract":"<div><div>With the rapid advancement of artificial intelligence and the Internet of Things, data collected from multiple sensing modalities is growing rapidly in both volume and complexity. In this paper, we propose a novel deep learning framework called MKNNet, which combines modality alignment, Transformer-based fusion, and multi-loss optimization to construct a unified semantic embedding space for multimodal information retrieval. Our model leverages modality-specific encoders and attention-based fusion to achieve deep semantic consistency across modalities. Experimental results on MS-COCO and Flickr30K datasets demonstrate that MKNNet significantly outperforms state-of-the-art models such as CLIP and BLIP in terms of Recall and mAP. The proposed method enhances semantic alignment and retrieval accuracy, showing great potential for applications in smart cities, healthcare, and other multimodal Internet of Things scenarios.</div></div>","PeriodicalId":7484,"journal":{"name":"alexandria engineering journal","volume":"127 ","pages":"Pages 1029-1039"},"PeriodicalIF":6.8000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MKNNet: Knowledge-aligned multimodal transformer for information retrieval\",\"authors\":\"Xiaoqin Lin , Chentao Han , Jian Yao , Yue Li , Xujun Wang , Shufeng Jia\",\"doi\":\"10.1016/j.aej.2025.06.055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the rapid advancement of artificial intelligence and the Internet of Things, data collected from multiple sensing modalities is growing rapidly in both volume and complexity. In this paper, we propose a novel deep learning framework called MKNNet, which combines modality alignment, Transformer-based fusion, and multi-loss optimization to construct a unified semantic embedding space for multimodal information retrieval. Our model leverages modality-specific encoders and attention-based fusion to achieve deep semantic consistency across modalities. Experimental results on MS-COCO and Flickr30K datasets demonstrate that MKNNet significantly outperforms state-of-the-art models such as CLIP and BLIP in terms of Recall and mAP. The proposed method enhances semantic alignment and retrieval accuracy, showing great potential for applications in smart cities, healthcare, and other multimodal Internet of Things scenarios.</div></div>\",\"PeriodicalId\":7484,\"journal\":{\"name\":\"alexandria engineering journal\",\"volume\":\"127 \",\"pages\":\"Pages 1029-1039\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2025-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"alexandria engineering journal\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1110016825008051\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"alexandria engineering journal","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110016825008051","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
MKNNet: Knowledge-aligned multimodal transformer for information retrieval
With the rapid advancement of artificial intelligence and the Internet of Things, data collected from multiple sensing modalities is growing rapidly in both volume and complexity. In this paper, we propose a novel deep learning framework called MKNNet, which combines modality alignment, Transformer-based fusion, and multi-loss optimization to construct a unified semantic embedding space for multimodal information retrieval. Our model leverages modality-specific encoders and attention-based fusion to achieve deep semantic consistency across modalities. Experimental results on MS-COCO and Flickr30K datasets demonstrate that MKNNet significantly outperforms state-of-the-art models such as CLIP and BLIP in terms of Recall and mAP. The proposed method enhances semantic alignment and retrieval accuracy, showing great potential for applications in smart cities, healthcare, and other multimodal Internet of Things scenarios.
期刊介绍:
Alexandria Engineering Journal is an international journal devoted to publishing high quality papers in the field of engineering and applied science. Alexandria Engineering Journal is cited in the Engineering Information Services (EIS) and the Chemical Abstracts (CA). The papers published in Alexandria Engineering Journal are grouped into five sections, according to the following classification:
• Mechanical, Production, Marine and Textile Engineering
• Electrical Engineering, Computer Science and Nuclear Engineering
• Civil and Architecture Engineering
• Chemical Engineering and Applied Sciences
• Environmental Engineering