MMAgentRec, a personalized multi-modal recommendation agent with large language model.

IF 3.8 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-04-08 DOI:10.1038/s41598-025-96458-w

Xiaochen Xiao

{"title":"MMAgentRec, a personalized multi-modal recommendation agent with large language model.","authors":"Xiaochen Xiao","doi":"10.1038/s41598-025-96458-w","DOIUrl":null,"url":null,"abstract":"<p><p>In multimodal recommendation, various data types, including text, images, and user dialogues, are utilized. However, it faces two primary challenges. Firstly, identifying user requirements is challenging due to their inherent complexity and diverse intentions. Secondly, the scarcity of high quality datasets and the unnaturalness of recommendation systems pose pressing issues. Especially interactive datasets,and datasets that can evaluate large models and human temporal interactions.In multimodal recommendation, users often face problems such as fragmented information and unclear needs. At the same time, data scarcity affects the accuracy and comprehensiveness of model evaluation and recommendation. This is a pain point in multimodal recommendation. Addressing these issues presents a significant opportunity for advancement. Combining multimodal backgrounds with large language models offers prospects for alleviating pain points. This integration enables systems to support a broader array of inputs, facilitating seamless dialogues and coherent responses. This article employs multimodal techniques, introducing cross-attention mechanisms, self-reflection mechanisms, along with multi-graph neural networks and residual networks. Multimodal techniques are responsible for handling data input problems. Cross-attention mechanisms are used to handle the combination of images and texts. Multi-graph neural networks and residual networks are used to build a recommendation system framework to improve the accuracy of recommendations. These are combined with an adapted large language model (LLM) using the reflection methodology,LLM takes advantage of its ease of communication with humans, proposing an autonomous decision-making and intelligent recommendation-capable multimodal system with self-reflective capabilities. The system includes a recommendation module that seeks advice from different domain experts based on user requirements. Through experimentation, our multimodal system has made significant strides in understanding user intent based on input keywords, demonstrating superiority over classic multimodal recommendation algorithms such as Blip2, clip. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. We conducted extensive evaluations to assess the effectiveness of our proposed model, including an ablation study, comparison with state-of-the-art methods, and performance analysis on multiple datasets. Ablation Study results demonstrate that the full model achieves the highest performance across all metrics, with an accuracy of 0.9526, precision of 0.94, recall of 0.95, and an F1 score of 0.94. Removing key components leads to performance degradation, with the exclusion of the LLM component having the most significant impact, reducing the F1 score to 0.91. The absence of MGCN and Cross-Attention also results in lower accuracy, confirming their critical role in enhancing model effectiveness. Comparison with state-of-the-art methods indicates that our model outperforms LightGCN and DualGNN in all key metrics. Specifically, LightGCN achieves an accuracy of 0.9210, while DualGNN reaches 0.9285, both falling short of the proposed model's performance. These results validate the superiority of our approach in handling complex multimodal tasks. Experimental results on multiple datasets further highlight the effectiveness of MGCN and Cross-Attention. On the QK-Video and QB-Video datasets, MGCN achieves the highest recall scores, with Recall@5 reaching 0.6556 and 0.6856, and Recall@50 attaining 0.9559 and 0.9059, respectively. Cross-Attention exhibits strong early recall capabilities, achieving Recall@10 of 0.8522 on the Tourism dataset. In contrast, Clip and Blip2 show moderate recall performance, with Clip achieving only 0.3423 for Recall@5 and Blip2 reaching 0.4531 on the Tourism dataset. Overall, our model consistently surpasses existing approaches, with MGCN and Cross-Attention demonstrating superior retrieval and classification performance across various tasks, underscoring their effectiveness in visual question answering (VQA). At the same time, this paper has constructed a comprehensive dataset in this field, each column contains 9004 data entries.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"12062"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-96458-w","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

In multimodal recommendation, various data types, including text, images, and user dialogues, are utilized. However, it faces two primary challenges. Firstly, identifying user requirements is challenging due to their inherent complexity and diverse intentions. Secondly, the scarcity of high quality datasets and the unnaturalness of recommendation systems pose pressing issues. Especially interactive datasets,and datasets that can evaluate large models and human temporal interactions.In multimodal recommendation, users often face problems such as fragmented information and unclear needs. At the same time, data scarcity affects the accuracy and comprehensiveness of model evaluation and recommendation. This is a pain point in multimodal recommendation. Addressing these issues presents a significant opportunity for advancement. Combining multimodal backgrounds with large language models offers prospects for alleviating pain points. This integration enables systems to support a broader array of inputs, facilitating seamless dialogues and coherent responses. This article employs multimodal techniques, introducing cross-attention mechanisms, self-reflection mechanisms, along with multi-graph neural networks and residual networks. Multimodal techniques are responsible for handling data input problems. Cross-attention mechanisms are used to handle the combination of images and texts. Multi-graph neural networks and residual networks are used to build a recommendation system framework to improve the accuracy of recommendations. These are combined with an adapted large language model (LLM) using the reflection methodology,LLM takes advantage of its ease of communication with humans, proposing an autonomous decision-making and intelligent recommendation-capable multimodal system with self-reflective capabilities. The system includes a recommendation module that seeks advice from different domain experts based on user requirements. Through experimentation, our multimodal system has made significant strides in understanding user intent based on input keywords, demonstrating superiority over classic multimodal recommendation algorithms such as Blip2, clip. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. This indicates that our system can intelligently generate suggestions, meeting user requirements and enhancing user experience. Our approach provides novel perspectives for the development of multimodal recommendation systems, holding substantial practical application potential and promising to propel their evolution in the information technology domain. We conducted extensive evaluations to assess the effectiveness of our proposed model, including an ablation study, comparison with state-of-the-art methods, and performance analysis on multiple datasets. Ablation Study results demonstrate that the full model achieves the highest performance across all metrics, with an accuracy of 0.9526, precision of 0.94, recall of 0.95, and an F1 score of 0.94. Removing key components leads to performance degradation, with the exclusion of the LLM component having the most significant impact, reducing the F1 score to 0.91. The absence of MGCN and Cross-Attention also results in lower accuracy, confirming their critical role in enhancing model effectiveness. Comparison with state-of-the-art methods indicates that our model outperforms LightGCN and DualGNN in all key metrics. Specifically, LightGCN achieves an accuracy of 0.9210, while DualGNN reaches 0.9285, both falling short of the proposed model's performance. These results validate the superiority of our approach in handling complex multimodal tasks. Experimental results on multiple datasets further highlight the effectiveness of MGCN and Cross-Attention. On the QK-Video and QB-Video datasets, MGCN achieves the highest recall scores, with Recall@5 reaching 0.6556 and 0.6856, and Recall@50 attaining 0.9559 and 0.9059, respectively. Cross-Attention exhibits strong early recall capabilities, achieving Recall@10 of 0.8522 on the Tourism dataset. In contrast, Clip and Blip2 show moderate recall performance, with Clip achieving only 0.3423 for Recall@5 and Blip2 reaching 0.4531 on the Tourism dataset. Overall, our model consistently surpasses existing approaches, with MGCN and Cross-Attention demonstrating superior retrieval and classification performance across various tasks, underscoring their effectiveness in visual question answering (VQA). At the same time, this paper has constructed a comprehensive dataset in this field, each column contains 9004 data entries.

查看原文本刊更多论文

MMAgentRec，具有大型语言模型的个性化多模态推荐代理。

在多模态推荐中，利用了各种数据类型，包括文本、图像和用户对话。然而，它面临两个主要挑战。首先，由于用户需求本身的复杂性和意图的多样性，识别用户需求具有挑战性。其次，高质量数据集的稀缺性和推荐系统的非自然性也是亟待解决的问题。在多模态推荐中，用户往往面临信息碎片化、需求不明确等问题。同时，数据稀缺也会影响模型评估和推荐的准确性和全面性。这是多模推荐的痛点。解决这些问题将带来巨大的发展机遇。将多模态背景与大型语言模型相结合，有望缓解痛点。这种整合使系统能够支持更广泛的输入，促进无缝对话和一致响应。本文采用了多模态技术，引入了交叉注意机制、自省机制以及多图神经网络和残差网络。多模态技术负责处理数据输入问题。交叉注意机制用于处理图像和文本的组合。多图神经网络和残差网络用于构建推荐系统框架，以提高推荐的准确性。多图神经网络和残差网络被用来构建推荐系统框架，以提高推荐的准确性。多图神经网络和残差网络与采用反射方法的改编大语言模型（LLM）相结合，利用 LLM 易于与人类交流的优势，提出了一个具有自我反射能力的自主决策和智能推荐功能的多模态系统。该系统包括一个推荐模块，可根据用户需求向不同领域的专家寻求建议。通过实验，我们的多模态系统在根据输入关键词理解用户意图方面取得了重大进展，显示出优于 Blip2、剪辑等经典多模态推荐算法的优势。这表明我们的系统能够智能地生成建议，满足用户需求，提升用户体验。我们的方法为多模态推荐系统的发展提供了新的视角，具有巨大的实际应用潜力，有望推动其在信息技术领域的发展。这表明我们的系统能够智能地生成建议，满足用户需求并增强用户体验。我们的方法为多模态推荐系统的开发提供了新的视角，具有巨大的实际应用潜力，有望推动其在信息技术领域的发展。我们进行了广泛的评估，包括消融研究、与最先进方法的比较以及在多个数据集上的性能分析，以评估我们提出的模型的有效性。消融研究结果表明，完整模型在所有指标上都达到了最高性能，准确率为 0.9526，精确度为 0.94，召回率为 0.95，F1 分数为 0.94。剔除关键组件会导致性能下降，其中剔除 LLM 组件的影响最大，使 F1 分数降至 0.91。缺少 MGCN 和 Cross-Attention 也会导致准确率降低，这证实了它们在提高模型有效性方面的关键作用。与最先进方法的比较表明，我们的模型在所有关键指标上都优于 LightGCN 和 DualGNN。具体来说，LightGCN 的准确率为 0.9210，而 DualGNN 的准确率为 0.9285，两者都低于所提出模型的性能。这些结果验证了我们的方法在处理复杂多模态任务时的优越性。在多个数据集上的实验结果进一步凸显了 MGCN 和 Cross-Attention 的有效性。在 QK-Video 和 QB-Video 数据集上，MGCN 的召回分数最高，Recall@5 分别达到 0.6556 和 0.6856，Recall@50 分别达到 0.9559 和 0.9059。Cross-Attention 表现出很强的早期召回能力，在旅游数据集上的 Recall@10 达到 0.8522。相比之下，Clip 和 Blip2 的召回率表现一般，Clip 在旅游数据集上的 Recall@5 仅为 0.3423，Blip2 为 0.4531。总体而言，我们的模型始终超越了现有的方法，MGCN 和 Cross-Attention 在各种任务中都表现出了卓越的检索和分类性能，突显了它们在视觉问题解答（VQA）中的有效性。同时，本文还构建了该领域的综合数据集，每列包含 9004 个数据条目。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.