Closed-Form Models of Accuracy Loss due to Subsampling in SVD Collaborative Filtering

IF 6.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics Pub Date : 2022-11-24 DOI:10.26599/BDMA.2022.9020024

Samin Poudel;Marwan Bikdash

{"title":"Closed-Form Models of Accuracy Loss due to Subsampling in SVD Collaborative Filtering","authors":"Samin Poudel;Marwan Bikdash","doi":"10.26599/BDMA.2022.9020024","DOIUrl":null,"url":null,"abstract":"We postulate and analyze a nonlinear subsampling accuracy loss (SSAL) model based on the root mean square error (RMSE) and two SSAL models based on the mean square error (MSE), suggested by extensive preliminary simulations. The SSAL models predict accuracy loss in terms of subsampling parameters like the fraction of users dropped (FUD) and the fraction of items dropped (FID). We seek to investigate whether the models depend on the characteristics of the dataset in a constant way across datasets when using the SVD collaborative filtering (CF) algorithm. The dataset characteristics considered include various densities of the rating matrix and the numbers of users and items. Extensive simulations and rigorous regression analysis led to empirical symmetrical SSAL models in terms of FID and FUD whose coefficients depend only on the data characteristics. The SSAL models came out to be multi-linear in terms of odds ratios of dropping a user (or an item) vs. not dropping it. Moreover, one MSE deterioration model turned out to be linear in the FID and FUD odds where their interaction term has a zero coefficient. Most importantly, the models are constant in the sense that they are written in closed-form using the considered data characteristics (densities and numbers of users and items). The models are validated through extensive simulations based on 850 synthetically generated primary (pre-subsampling) matrices derived from the 25M MovieLens dataset. Nearly 460 000 subsampled rating matrices were then simulated and subjected to the singular value decomposition (SVD) CF algorithm. Further validation was conducted using the 1M MovieLens and the Yahoo! Music Rating datasets. The models were constant and significant across all 3 datasets.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 1","pages":"72-84"},"PeriodicalIF":6.2000,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9962810/09963626.pdf","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Mining and Analytics","FirstCategoryId":"1093","ListUrlMain":"https://ieeexplore.ieee.org/document/9963626/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

We postulate and analyze a nonlinear subsampling accuracy loss (SSAL) model based on the root mean square error (RMSE) and two SSAL models based on the mean square error (MSE), suggested by extensive preliminary simulations. The SSAL models predict accuracy loss in terms of subsampling parameters like the fraction of users dropped (FUD) and the fraction of items dropped (FID). We seek to investigate whether the models depend on the characteristics of the dataset in a constant way across datasets when using the SVD collaborative filtering (CF) algorithm. The dataset characteristics considered include various densities of the rating matrix and the numbers of users and items. Extensive simulations and rigorous regression analysis led to empirical symmetrical SSAL models in terms of FID and FUD whose coefficients depend only on the data characteristics. The SSAL models came out to be multi-linear in terms of odds ratios of dropping a user (or an item) vs. not dropping it. Moreover, one MSE deterioration model turned out to be linear in the FID and FUD odds where their interaction term has a zero coefficient. Most importantly, the models are constant in the sense that they are written in closed-form using the considered data characteristics (densities and numbers of users and items). The models are validated through extensive simulations based on 850 synthetically generated primary (pre-subsampling) matrices derived from the 25M MovieLens dataset. Nearly 460 000 subsampled rating matrices were then simulated and subjected to the singular value decomposition (SVD) CF algorithm. Further validation was conducted using the 1M MovieLens and the Yahoo! Music Rating datasets. The models were constant and significant across all 3 datasets.

查看原文本刊更多论文

SVD协同滤波中子采样导致精度损失的闭式模型

我们假设并分析了一个基于均方根误差（RMSE）的非线性二次采样精度损失（SSAL）模型和两个基于均方误差（MSE）的SSAL模型，这两个模型是由大量的初步模拟提出的。SSAL模型根据二次采样参数预测准确性损失，如丢弃用户的分数（FUD）和丢弃项目的分数（FID）。当使用SVD协同过滤（CF）算法时，我们试图研究模型是否以恒定的方式在数据集之间依赖于数据集的特征。所考虑的数据集特征包括评级矩阵的各种密度以及用户和项目的数量。广泛的模拟和严格的回归分析导致了FID和FUD方面的经验对称SSAL模型，其系数仅取决于数据特征。SSAL模型在丢弃用户（或物品）与不丢弃用户（和物品）的比值比方面是多线性的。此外，一个MSE恶化模型在FID和FUD比值中是线性的，其中它们的交互项具有零系数。最重要的是，模型是恒定的，因为它们是使用所考虑的数据特征（用户和项目的密度和数量）以封闭形式编写的。基于从25M MovieLens数据集导出的850个合成生成的初级（预子采样）矩阵，通过广泛的模拟对模型进行了验证。然后模拟了近46万个子采样的评级矩阵，并对其进行奇异值分解（SVD）CF算法。使用1M MovieLens和Yahoo！音乐分级数据集。这些模型在所有3个数据集中都是恒定且显著的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Mining and Analytics Computer Science-Computer Science Applications

CiteScore

20.90

自引率

2.20%

发文量

期刊介绍： Big Data Mining and Analytics, a publication by Tsinghua University Press, presents groundbreaking research in the field of big data research and its applications. This comprehensive book delves into the exploration and analysis of vast amounts of data from diverse sources to uncover hidden patterns, correlations, insights, and knowledge. Featuring the latest developments, research issues, and solutions, this book offers valuable insights into the world of big data. It provides a deep understanding of data mining techniques, data analytics, and their practical applications. Big Data Mining and Analytics has gained significant recognition and is indexed and abstracted in esteemed platforms such as ESCI, EI, Scopus, DBLP Computer Science, Google Scholar, INSPEC, CSCD, DOAJ, CNKI, and more. With its wealth of information and its ability to transform the way we perceive and utilize data, this book is a must-read for researchers, professionals, and anyone interested in the field of big data analytics.