{"title":"Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation","authors":"Chengbing Wang, Wentao Shi, Jizhi Zhang, Wenjie Wang, Hang Pan, Fuli Feng","doi":"arxiv-2409.04810","DOIUrl":null,"url":null,"abstract":"Recent work has improved recommendation models remarkably by equipping them\nwith debiasing methods. Due to the unavailability of fully-exposed datasets,\nmost existing approaches resort to randomly-exposed datasets as a proxy for\nevaluating debiased models, employing traditional evaluation scheme to\nrepresent the recommendation performance. However, in this study, we reveal\nthat traditional evaluation scheme is not suitable for randomly-exposed\ndatasets, leading to inconsistency between the Recall performance obtained\nusing randomly-exposed datasets and that obtained using fully-exposed datasets.\nSuch inconsistency indicates the potential unreliability of experiment\nconclusions on previous debiasing techniques and calls for unbiased Recall\nevaluation using randomly-exposed datasets. To bridge the gap, we propose the\nUnbiased Recall Evaluation (URE) scheme, which adjusts the utilization of\nrandomly-exposed datasets to unbiasedly estimate the true Recall performance on\nfully-exposed datasets. We provide theoretical evidence to demonstrate the\nrationality of URE and perform extensive experiments on real-world datasets to\nvalidate its soundness.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent work has improved recommendation models remarkably by equipping them
with debiasing methods. Due to the unavailability of fully-exposed datasets,
most existing approaches resort to randomly-exposed datasets as a proxy for
evaluating debiased models, employing traditional evaluation scheme to
represent the recommendation performance. However, in this study, we reveal
that traditional evaluation scheme is not suitable for randomly-exposed
datasets, leading to inconsistency between the Recall performance obtained
using randomly-exposed datasets and that obtained using fully-exposed datasets.
Such inconsistency indicates the potential unreliability of experiment
conclusions on previous debiasing techniques and calls for unbiased Recall
evaluation using randomly-exposed datasets. To bridge the gap, we propose the
Unbiased Recall Evaluation (URE) scheme, which adjusts the utilization of
randomly-exposed datasets to unbiasedly estimate the true Recall performance on
fully-exposed datasets. We provide theoretical evidence to demonstrate the
rationality of URE and perform extensive experiments on real-world datasets to
validate its soundness.