A multimodal framework for enhancing E-commerce information management using vision transformers and large language models

International Journal of Information Management Data Insights Pub Date : 2025-07-09 DOI:10.1016/j.jjimei.2025.100355

Anitha Balachandran , Mohammad Masum

{"title":"A multimodal framework for enhancing E-commerce information management using vision transformers and large language models","authors":"Anitha Balachandran , Mohammad Masum","doi":"10.1016/j.jjimei.2025.100355","DOIUrl":null,"url":null,"abstract":"<div><div>In the rapidly advancing field of visual search technology, traditional methods that rely only on visual features often struggle with accuracy and relevance. This challenge is particularly evident in e-commerce, where precise product recommendations are critical, and is further complicated by keyword stuffing in product descriptions. To address these limitations, this study introduces BiLens, a multimodal recommendation framework that integrates both visual and textual information. BiLens leverages large language models (LLMs) to generate descriptive captions from image queries, which are transformed into word embeddings, and extracts visual features using Vision Transformers (ViT). The visual and textual representations are integrated using an early fusion strategy and compared using cosine similarity, enabling deeper contextual understanding and enhancing the accuracy and relevance of product recommendations in capturing customer intent. A comprehensive evaluation was conducted using Amazon product data across five categories, testing various image captioning models and embedding methods—including BLIP-2, ViT-GPT2, BLIP-Image-Captioning-Large, Florence-2-large, GIT (microsoft/git-base-coco), Word2Vec, GloVe, BERT, and ELMo. The combination of Florence-2-large and BERT emerged as the most effective, achieving a <math><mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></math> of <math><mrow><mn>0.81</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.14</mn></mrow></math> and <math><mrow><mi>F</mi><mn>1</mn></mrow></math> score of <math><mrow><mn>0.49</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.16</mn></mrow></math>. This setup was further validated on the Myntra dataset, showing generalizability with <math><mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></math> of <math><mrow><mn>0.59</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.27</mn></mrow></math>, <math><mrow><mi>r</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow></math> of <math><mrow><mn>0.47</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.25</mn></mrow></math>, and <math><mrow><mi>F</mi><mn>1</mn></mrow></math> score of <math><mrow><mn>0.52</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.24</mn></mrow></math>. Comparisons with image-only and text-only baselines confirmed the superiority of the fusion-based approach, with statistically significant improvements in F1 scores, underscoring BiLens’s ability to deliver more accurate, context-aware product recommendations.</div></div>","PeriodicalId":100699,"journal":{"name":"International Journal of Information Management Data Insights","volume":"5 2","pages":"Article 100355"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Management Data Insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667096825000370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the rapidly advancing field of visual search technology, traditional methods that rely only on visual features often struggle with accuracy and relevance. This challenge is particularly evident in e-commerce, where precise product recommendations are critical, and is further complicated by keyword stuffing in product descriptions. To address these limitations, this study introduces BiLens, a multimodal recommendation framework that integrates both visual and textual information. BiLens leverages large language models (LLMs) to generate descriptive captions from image queries, which are transformed into word embeddings, and extracts visual features using Vision Transformers (ViT). The visual and textual representations are integrated using an early fusion strategy and compared using cosine similarity, enabling deeper contextual understanding and enhancing the accuracy and relevance of product recommendations in capturing customer intent. A comprehensive evaluation was conducted using Amazon product data across five categories, testing various image captioning models and embedding methods—including BLIP-2, ViT-GPT2, BLIP-Image-Captioning-Large, Florence-2-large, GIT (microsoft/git-base-coco), Word2Vec, GloVe, BERT, and ELMo. The combination of Florence-2-large and BERT emerged as the most effective, achieving a

p r e c i s i o n

0.81 \pm 0.14

and

F 1

score of

0.49 \pm 0.16

. This setup was further validated on the Myntra dataset, showing generalizability with

p r e c i s i o n

0.59 \pm 0.27

r e c a l l

0.47 \pm 0.25

, and

F 1

score of

0.52 \pm 0.24

. Comparisons with image-only and text-only baselines confirmed the superiority of the fusion-based approach, with statistically significant improvements in F1 scores, underscoring BiLens’s ability to deliver more accurate, context-aware product recommendations.

查看原文本刊更多论文

使用视觉转换器和大型语言模型增强电子商务信息管理的多模态框架

在快速发展的视觉搜索技术领域，传统的仅依赖于视觉特征的方法往往在准确性和相关性方面存在问题。这一挑战在电子商务中尤为明显，在电子商务中，精确的产品推荐是至关重要的，而在产品描述中填充关键字则使问题变得更加复杂。为了解决这些限制，本研究引入了BiLens，这是一个集成了视觉和文本信息的多模式推荐框架。BiLens利用大型语言模型（llm）从图像查询生成描述性字幕，将其转换为词嵌入，并使用视觉变形器（ViT）提取视觉特征。视觉和文本表示使用早期融合策略进行集成，并使用余弦相似性进行比较，从而实现更深层次的上下文理解，并提高产品推荐在捕获客户意图方面的准确性和相关性。利用亚马逊5个类别的产品数据进行综合评估，测试了各种图像字幕模型和嵌入方法，包括BLIP-2、ViT-GPT2、blip - image - captioninglarge、Florence-2-large、GIT （microsoft/ GIT -base-coco）、Word2Vec、GloVe、BERT和ELMo。Florence-2-large与BERT的结合是最有效的，精度为0.81±0.14，F1评分为0.49±0.16。在Myntra数据集上进一步验证了该设置，结果表明该设置具有可泛化性，精度为0.59±0.27，召回率为0.47±0.25，F1得分为0.52±0.24。与纯图像和纯文本基线的比较证实了基于融合的方法的优越性，F1分数在统计上有显著提高，强调了BiLens能够提供更准确、上下文感知的产品推荐。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Information Management Data Insights

CiteScore

19.20

自引率

0.00%

发文量