RecBERT: Semantic Recommendation Engine with Large Language Model Enhanced Query Segmentation for k-Nearest Neighbors Ranking Retrieval

Intelligent and Converged Networks Pub Date : 2024-01-09 DOI:10.23919/ICN.2024.0004

Richard Wu

{"title":"RecBERT: Semantic Recommendation Engine with Large Language Model Enhanced Query Segmentation for k-Nearest Neighbors Ranking Retrieval","authors":"Richard Wu","doi":"10.23919/ICN.2024.0004","DOIUrl":null,"url":null,"abstract":"The increasing amount of user traffic on Internet discussion forums has led to a huge amount of unstructured natural language data in the form of user comments. Most modern recommendation systems rely on manual tagging, relying on administrators to label the features of a class, or story, which a user comment corresponds to. Another common approach is to use pre-trained word embeddings to compare class descriptions for textual similarity, then use a distance metric such as cosine similarity or Euclidean distance to find top \n<tex>$k$</tex>\n neighbors. However, neither approach is able to fully utilize this user-generated unstructured natural language data, reducing the scope of these recommendation systems. This paper studies the application of domain adaptation on a transformer for the set of user comments to be indexed, and the use of simple contrastive learning for the sentence transformer fine-tuning process to generate meaningful semantic embeddings for the various user comments that apply to each class. In order to match a query containing content from multiple user comments belonging to the same class, the construction of a subquery channel for computing class-level similarity is proposed. This channel uses query segmentation of the aggregate query into subqueries, performing k-nearest neighbors (KNN) search on each individual subquery. RecBERT achieves state-of-the-art performance, outperforming other state-of-the-art models in accuracy, precision, recall, and F1 score for classifying comments between four and eight classes, respectively. RecBERT outperforms the most precise state-of-the-art model (distilRoBERTa) in precision by 6.97% for matching comments between eight classes.","PeriodicalId":100681,"journal":{"name":"Intelligent and Converged Networks","volume":"5 1","pages":"42-52"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10387238","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent and Converged Networks","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10387238/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing amount of user traffic on Internet discussion forums has led to a huge amount of unstructured natural language data in the form of user comments. Most modern recommendation systems rely on manual tagging, relying on administrators to label the features of a class, or story, which a user comment corresponds to. Another common approach is to use pre-trained word embeddings to compare class descriptions for textual similarity, then use a distance metric such as cosine similarity or Euclidean distance to find top $k$ neighbors. However, neither approach is able to fully utilize this user-generated unstructured natural language data, reducing the scope of these recommendation systems. This paper studies the application of domain adaptation on a transformer for the set of user comments to be indexed, and the use of simple contrastive learning for the sentence transformer fine-tuning process to generate meaningful semantic embeddings for the various user comments that apply to each class. In order to match a query containing content from multiple user comments belonging to the same class, the construction of a subquery channel for computing class-level similarity is proposed. This channel uses query segmentation of the aggregate query into subqueries, performing k-nearest neighbors (KNN) search on each individual subquery. RecBERT achieves state-of-the-art performance, outperforming other state-of-the-art models in accuracy, precision, recall, and F1 score for classifying comments between four and eight classes, respectively. RecBERT outperforms the most precise state-of-the-art model (distilRoBERTa) in precision by 6.97% for matching comments between eight classes.

查看原文本刊更多论文

RecBERT：采用大语言模型增强查询分割的语义推荐引擎，用于 k 近邻排序检索

互联网讨论区的用户流量不断增加，从而产生了大量以用户评论为形式的非结构化自然语言数据。大多数现代推荐系统都依赖于人工标记，由管理员标注用户评论所对应的类或故事的特征。另一种常见的方法是使用预先训练好的词嵌入来比较类别描述的文本相似性，然后使用余弦相似性或欧氏距离等距离度量来找到前 $k$ 邻居。然而，这两种方法都无法充分利用用户生成的非结构化自然语言数据，从而缩小了这些推荐系统的应用范围。本文研究了将领域适应性应用于要索引的用户评论集的转换器上，以及将简单的对比学习用于句子转换器微调过程，从而为适用于每个类别的各种用户评论生成有意义的语义嵌入。为了匹配包含属于同一类别的多条用户评论内容的查询，我们提出了一个用于计算类别级相似性的子查询通道。该通道将汇总查询分割成子查询，在每个子查询上执行 k-nearest neighbors (KNN) 搜索。RecBERT 实现了最先进的性能，在对四类和八类评论进行分类时，其准确度、精确度、召回率和 F1 分数分别优于其他最先进的模型。在匹配八个类别之间的评论时，RecBERT 的精确度比最精确的最新模型（distilRoBERTa）高出 6.97%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent and Converged Networks

自引率

0.00%

发文量