{"title":"MMAgentRec, a personalized multi-modal recommendation agent with large language model.","authors":"Xiaochen Xiao","doi":"10.1038/s41598-025-96458-w","DOIUrl":null,"url":null,"abstract":"<p><p>In multimodal recommendation, various data types, including text, images, and user dialogues, are utilized. However, it faces two primary challenges. Firstly, identifying user requirements is challenging due to their inherent complexity and diverse intentions. Secondly, the scarcity of high quality datasets and the unnaturalness of recommendation systems pose pressing issues. Especially interactive datasets,and datasets that can evaluate large models and human temporal interactions.In multimodal recommendation, users often face problems such as fragmented information and unclear needs. At the same time, data scarcity affects the accuracy and comprehensiveness of model evaluation and recommendation. This is a pain point in multimodal recommendation. Addressing these issues presents a significant opportunity for advancement. Combining multimodal backgrounds with large language models offers prospects for alleviating pain points. This integration enables systems to support a broader array of inputs, facilitating seamless dialogues and coherent responses. This article employs multimodal techniques, introducing cross-attention mechanisms, self-reflection mechanisms, along with multi-graph neural networks and residual networks. Multimodal techniques are responsible for handling data input problems. Cross-attention mechanisms are used to handle the combination of images and texts. Multi-graph neural networks and residual networks are used to build a recommendation system framework to improve the accuracy of recommendations. These are combined with an adapted large language model (LLM) using the reflection methodology,LLM takes advantage of its ease of communication with humans, proposing an autonomous decision-making and intelligent recommendation-capable multimodal system with self-reflective capabilities. The system includes a recommendation module that seeks advice from different domain experts based on user requirements. Through experimentation, our multimodal system has made significant strides in understanding user intent based on input keywords, demonstrating superiority over classic multimodal recommendation algorithms such as Blip2, clip. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. We conducted extensive evaluations to assess the effectiveness of our proposed model, including an ablation study, comparison with state-of-the-art methods, and performance analysis on multiple datasets. Ablation Study results demonstrate that the full model achieves the highest performance across all metrics, with an accuracy of 0.9526, precision of 0.94, recall of 0.95, and an F1 score of 0.94. Removing key components leads to performance degradation, with the exclusion of the LLM component having the most significant impact, reducing the F1 score to 0.91. The absence of MGCN and Cross-Attention also results in lower accuracy, confirming their critical role in enhancing model effectiveness. Comparison with state-of-the-art methods indicates that our model outperforms LightGCN and DualGNN in all key metrics. Specifically, LightGCN achieves an accuracy of 0.9210, while DualGNN reaches 0.9285, both falling short of the proposed model's performance. These results validate the superiority of our approach in handling complex multimodal tasks. Experimental results on multiple datasets further highlight the effectiveness of MGCN and Cross-Attention. On the QK-Video and QB-Video datasets, MGCN achieves the highest recall scores, with Recall@5 reaching 0.6556 and 0.6856, and Recall@50 attaining 0.9559 and 0.9059, respectively. Cross-Attention exhibits strong early recall capabilities, achieving Recall@10 of 0.8522 on the Tourism dataset. In contrast, Clip and Blip2 show moderate recall performance, with Clip achieving only 0.3423 for Recall@5 and Blip2 reaching 0.4531 on the Tourism dataset. Overall, our model consistently surpasses existing approaches, with MGCN and Cross-Attention demonstrating superior retrieval and classification performance across various tasks, underscoring their effectiveness in visual question answering (VQA). At the same time, this paper has constructed a comprehensive dataset in this field, each column contains 9004 data entries.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"12062"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-96458-w","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
In multimodal recommendation, various data types, including text, images, and user dialogues, are utilized. However, it faces two primary challenges. Firstly, identifying user requirements is challenging due to their inherent complexity and diverse intentions. Secondly, the scarcity of high quality datasets and the unnaturalness of recommendation systems pose pressing issues. Especially interactive datasets,and datasets that can evaluate large models and human temporal interactions.In multimodal recommendation, users often face problems such as fragmented information and unclear needs. At the same time, data scarcity affects the accuracy and comprehensiveness of model evaluation and recommendation. This is a pain point in multimodal recommendation. Addressing these issues presents a significant opportunity for advancement. Combining multimodal backgrounds with large language models offers prospects for alleviating pain points. This integration enables systems to support a broader array of inputs, facilitating seamless dialogues and coherent responses. This article employs multimodal techniques, introducing cross-attention mechanisms, self-reflection mechanisms, along with multi-graph neural networks and residual networks. Multimodal techniques are responsible for handling data input problems. Cross-attention mechanisms are used to handle the combination of images and texts. Multi-graph neural networks and residual networks are used to build a recommendation system framework to improve the accuracy of recommendations. These are combined with an adapted large language model (LLM) using the reflection methodology,LLM takes advantage of its ease of communication with humans, proposing an autonomous decision-making and intelligent recommendation-capable multimodal system with self-reflective capabilities. The system includes a recommendation module that seeks advice from different domain experts based on user requirements. Through experimentation, our multimodal system has made significant strides in understanding user intent based on input keywords, demonstrating superiority over classic multimodal recommendation algorithms such as Blip2, clip. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. We conducted extensive evaluations to assess the effectiveness of our proposed model, including an ablation study, comparison with state-of-the-art methods, and performance analysis on multiple datasets. Ablation Study results demonstrate that the full model achieves the highest performance across all metrics, with an accuracy of 0.9526, precision of 0.94, recall of 0.95, and an F1 score of 0.94. Removing key components leads to performance degradation, with the exclusion of the LLM component having the most significant impact, reducing the F1 score to 0.91. The absence of MGCN and Cross-Attention also results in lower accuracy, confirming their critical role in enhancing model effectiveness. Comparison with state-of-the-art methods indicates that our model outperforms LightGCN and DualGNN in all key metrics. Specifically, LightGCN achieves an accuracy of 0.9210, while DualGNN reaches 0.9285, both falling short of the proposed model's performance. These results validate the superiority of our approach in handling complex multimodal tasks. Experimental results on multiple datasets further highlight the effectiveness of MGCN and Cross-Attention. On the QK-Video and QB-Video datasets, MGCN achieves the highest recall scores, with Recall@5 reaching 0.6556 and 0.6856, and Recall@50 attaining 0.9559 and 0.9059, respectively. Cross-Attention exhibits strong early recall capabilities, achieving Recall@10 of 0.8522 on the Tourism dataset. In contrast, Clip and Blip2 show moderate recall performance, with Clip achieving only 0.3423 for Recall@5 and Blip2 reaching 0.4531 on the Tourism dataset. Overall, our model consistently surpasses existing approaches, with MGCN and Cross-Attention demonstrating superior retrieval and classification performance across various tasks, underscoring their effectiveness in visual question answering (VQA). At the same time, this paper has constructed a comprehensive dataset in this field, each column contains 9004 data entries.
期刊介绍:
We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections.
Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021).
•Engineering
Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live.
•Physical sciences
Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics.
•Earth and environmental sciences
Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems.
•Biological sciences
Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants.
•Health sciences
The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.