{"title":"A multimodal framework for enhancing E-commerce information management using vision transformers and large language models","authors":"Anitha Balachandran , Mohammad Masum","doi":"10.1016/j.jjimei.2025.100355","DOIUrl":null,"url":null,"abstract":"<div><div>In the rapidly advancing field of visual search technology, traditional methods that rely only on visual features often struggle with accuracy and relevance. This challenge is particularly evident in e-commerce, where precise product recommendations are critical, and is further complicated by keyword stuffing in product descriptions. To address these limitations, this study introduces BiLens, a multimodal recommendation framework that integrates both visual and textual information. BiLens leverages large language models (LLMs) to generate descriptive captions from image queries, which are transformed into word embeddings, and extracts visual features using Vision Transformers (ViT). The visual and textual representations are integrated using an early fusion strategy and compared using cosine similarity, enabling deeper contextual understanding and enhancing the accuracy and relevance of product recommendations in capturing customer intent. A comprehensive evaluation was conducted using Amazon product data across five categories, testing various image captioning models and embedding methods—including BLIP-2, ViT-GPT2, BLIP-Image-Captioning-Large, Florence-2-large, GIT (microsoft/git-base-coco), Word2Vec, GloVe, BERT, and ELMo. The combination of Florence-2-large and BERT emerged as the most effective, achieving a <span><math><mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></math></span> of <span><math><mrow><mn>0.81</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.14</mn></mrow></math></span> and <span><math><mrow><mi>F</mi><mn>1</mn></mrow></math></span> score of <span><math><mrow><mn>0.49</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.16</mn></mrow></math></span>. This setup was further validated on the Myntra dataset, showing generalizability with <span><math><mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></math></span> of <span><math><mrow><mn>0.59</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.27</mn></mrow></math></span>, <span><math><mrow><mi>r</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow></math></span> of <span><math><mrow><mn>0.47</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.25</mn></mrow></math></span>, and <span><math><mrow><mi>F</mi><mn>1</mn></mrow></math></span> score of <span><math><mrow><mn>0.52</mn><mspace></mspace><mo>±</mo><mspace></mspace><mn>0.24</mn></mrow></math></span>. Comparisons with image-only and text-only baselines confirmed the superiority of the fusion-based approach, with statistically significant improvements in F1 scores, underscoring BiLens’s ability to deliver more accurate, context-aware product recommendations.</div></div>","PeriodicalId":100699,"journal":{"name":"International Journal of Information Management Data Insights","volume":"5 2","pages":"Article 100355"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Management Data Insights","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667096825000370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the rapidly advancing field of visual search technology, traditional methods that rely only on visual features often struggle with accuracy and relevance. This challenge is particularly evident in e-commerce, where precise product recommendations are critical, and is further complicated by keyword stuffing in product descriptions. To address these limitations, this study introduces BiLens, a multimodal recommendation framework that integrates both visual and textual information. BiLens leverages large language models (LLMs) to generate descriptive captions from image queries, which are transformed into word embeddings, and extracts visual features using Vision Transformers (ViT). The visual and textual representations are integrated using an early fusion strategy and compared using cosine similarity, enabling deeper contextual understanding and enhancing the accuracy and relevance of product recommendations in capturing customer intent. A comprehensive evaluation was conducted using Amazon product data across five categories, testing various image captioning models and embedding methods—including BLIP-2, ViT-GPT2, BLIP-Image-Captioning-Large, Florence-2-large, GIT (microsoft/git-base-coco), Word2Vec, GloVe, BERT, and ELMo. The combination of Florence-2-large and BERT emerged as the most effective, achieving a of and score of . This setup was further validated on the Myntra dataset, showing generalizability with of , of , and score of . Comparisons with image-only and text-only baselines confirmed the superiority of the fusion-based approach, with statistically significant improvements in F1 scores, underscoring BiLens’s ability to deliver more accurate, context-aware product recommendations.