{"title":"RecBERT: Semantic Recommendation Engine with Large Language Model Enhanced Query Segmentation for k-Nearest Neighbors Ranking Retrieval","authors":"Richard Wu","doi":"10.23919/ICN.2024.0004","DOIUrl":null,"url":null,"abstract":"The increasing amount of user traffic on Internet discussion forums has led to a huge amount of unstructured natural language data in the form of user comments. Most modern recommendation systems rely on manual tagging, relying on administrators to label the features of a class, or story, which a user comment corresponds to. Another common approach is to use pre-trained word embeddings to compare class descriptions for textual similarity, then use a distance metric such as cosine similarity or Euclidean distance to find top \n<tex>$k$</tex>\n neighbors. However, neither approach is able to fully utilize this user-generated unstructured natural language data, reducing the scope of these recommendation systems. This paper studies the application of domain adaptation on a transformer for the set of user comments to be indexed, and the use of simple contrastive learning for the sentence transformer fine-tuning process to generate meaningful semantic embeddings for the various user comments that apply to each class. In order to match a query containing content from multiple user comments belonging to the same class, the construction of a subquery channel for computing class-level similarity is proposed. This channel uses query segmentation of the aggregate query into subqueries, performing k-nearest neighbors (KNN) search on each individual subquery. RecBERT achieves state-of-the-art performance, outperforming other state-of-the-art models in accuracy, precision, recall, and F1 score for classifying comments between four and eight classes, respectively. RecBERT outperforms the most precise state-of-the-art model (distilRoBERTa) in precision by 6.97% for matching comments between eight classes.","PeriodicalId":100681,"journal":{"name":"Intelligent and Converged Networks","volume":"5 1","pages":"42-52"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10387238","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent and Converged Networks","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10387238/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing amount of user traffic on Internet discussion forums has led to a huge amount of unstructured natural language data in the form of user comments. Most modern recommendation systems rely on manual tagging, relying on administrators to label the features of a class, or story, which a user comment corresponds to. Another common approach is to use pre-trained word embeddings to compare class descriptions for textual similarity, then use a distance metric such as cosine similarity or Euclidean distance to find top
$k$
neighbors. However, neither approach is able to fully utilize this user-generated unstructured natural language data, reducing the scope of these recommendation systems. This paper studies the application of domain adaptation on a transformer for the set of user comments to be indexed, and the use of simple contrastive learning for the sentence transformer fine-tuning process to generate meaningful semantic embeddings for the various user comments that apply to each class. In order to match a query containing content from multiple user comments belonging to the same class, the construction of a subquery channel for computing class-level similarity is proposed. This channel uses query segmentation of the aggregate query into subqueries, performing k-nearest neighbors (KNN) search on each individual subquery. RecBERT achieves state-of-the-art performance, outperforming other state-of-the-art models in accuracy, precision, recall, and F1 score for classifying comments between four and eight classes, respectively. RecBERT outperforms the most precise state-of-the-art model (distilRoBERTa) in precision by 6.97% for matching comments between eight classes.