On the Effectiveness of Sampled Softmax Loss for Item Recommendation

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Information Systems Pub Date : 2023-12-13 DOI:10.1145/3637061

Jiancan Wu, Xiang Wang, Xingyu Gao, Jiawei Chen, Hongcheng Fu, Tianyu Qiu, Xiangnan He

{"title":"On the Effectiveness of Sampled Softmax Loss for Item Recommendation","authors":"Jiancan Wu, Xiang Wang, Xingyu Gao, Jiawei Chen, Hongcheng Fu, Tianyu Qiu, Xiangnan He","doi":"10.1145/3637061","DOIUrl":null,"url":null,"abstract":"The learning objective plays a fundamental role to build a recommender system. Most methods routinely adopt either pointwise (e.g., binary cross-entropy) or pairwise (e.g., BPR) loss to train the model parameters, while rarely pay attention to softmax loss, which assumes the probabilities of all classes sum up to 1, due to its computational complexity when scaling up to large datasets or intractability for streaming data where the complete item space is not always available. The sampled softmax (SSM) loss emerges as an efficient substitute for softmax loss. Its special case, InfoNCE loss, has been widely used in self-supervised learning and exhibited remarkable performance for contrastive learning. Nonetheless, limited recommendation work uses the SSM loss as the learning objective. Worse still, none of them explores its properties thoroughly and answers “Does SSM loss suit for item recommendation?” and “What are the conceptual advantages of SSM loss, as compared with the prevalent losses?”, to the best of our knowledge. In this work, we aim to offer a better understanding of SSM for item recommendation. Specifically, we first theoretically reveal three model-agnostic advantages: (1) mitigating popularity bias, which is beneficial to long-tail recommendation; (2) mining hard negative samples, which offers informative gradients to optimize model parameters; and (3) maximizing the ranking metric, which facilitates top-K performance. However, based on our empirical studies, we recognize that the default choice of cosine similarity function in SSM limits its ability in learning the magnitudes of representation vectors. As such, the combinations of SSM with the models that also fall short in adjusting magnitudes (e.g., matrix factorization) may result in poor representations. One step further, we provide mathematical proof that message passing schemes in graph convolution networks can adjust representation magnitude according to node degree, which naturally compensates for the shortcoming of SSM. Extensive experiments on four benchmark datasets justify our analyses, demonstrating the superiority of SSM for item recommendation. Our implementations are available in both TensorFlow and PyTorch.","PeriodicalId":50936,"journal":{"name":"ACM Transactions on Information Systems","volume":"5 1","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3637061","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The learning objective plays a fundamental role to build a recommender system. Most methods routinely adopt either pointwise (e.g., binary cross-entropy) or pairwise (e.g., BPR) loss to train the model parameters, while rarely pay attention to softmax loss, which assumes the probabilities of all classes sum up to 1, due to its computational complexity when scaling up to large datasets or intractability for streaming data where the complete item space is not always available. The sampled softmax (SSM) loss emerges as an efficient substitute for softmax loss. Its special case, InfoNCE loss, has been widely used in self-supervised learning and exhibited remarkable performance for contrastive learning. Nonetheless, limited recommendation work uses the SSM loss as the learning objective. Worse still, none of them explores its properties thoroughly and answers “Does SSM loss suit for item recommendation?” and “What are the conceptual advantages of SSM loss, as compared with the prevalent losses?”, to the best of our knowledge.

In this work, we aim to offer a better understanding of SSM for item recommendation. Specifically, we first theoretically reveal three model-agnostic advantages: (1) mitigating popularity bias, which is beneficial to long-tail recommendation; (2) mining hard negative samples, which offers informative gradients to optimize model parameters; and (3) maximizing the ranking metric, which facilitates top-K performance. However, based on our empirical studies, we recognize that the default choice of cosine similarity function in SSM limits its ability in learning the magnitudes of representation vectors. As such, the combinations of SSM with the models that also fall short in adjusting magnitudes (e.g., matrix factorization) may result in poor representations. One step further, we provide mathematical proof that message passing schemes in graph convolution networks can adjust representation magnitude according to node degree, which naturally compensates for the shortcoming of SSM. Extensive experiments on four benchmark datasets justify our analyses, demonstrating the superiority of SSM for item recommendation. Our implementations are available in both TensorFlow and PyTorch.

查看原文本刊更多论文

论采样软最大损失在项目推荐中的有效性

学习目标对建立推荐系统起着至关重要的作用。大多数方法通常采用点状损失（如二元交叉熵）或成对损失（如 BPR）来训练模型参数，而很少关注软最大损失，因为软最大损失假定所有类别的概率总和为 1，这是因为它在扩展到大型数据集时计算复杂，或者对于并非总是有完整项目空间的流数据来说难以处理。采样软最大（SSM）损失作为软最大损失的有效替代品应运而生。它的特例--InfoNCE 损失已被广泛应用于自我监督学习中，并在对比学习中表现出卓越的性能。然而，使用 SSM 损失作为学习目标的推荐工作非常有限。更糟糕的是，据我们所知，没有一项研究深入探讨了 SSM 损失的特性，并回答了 "SSM 损失是否适合项目推荐？"以及 "与流行的损失相比，SSM 损失在概念上有哪些优势？"的问题。在这项工作中，我们旨在更好地理解用于项目推荐的 SSM。具体来说，我们首先从理论上揭示了与模型无关的三个优势：(1) 减少流行度偏差，这有利于长尾推荐；(2) 挖掘硬负样本，这为优化模型参数提供了信息梯度；(3) 最大化排名度量，这有利于 Top-K 性能。然而，根据我们的实证研究，我们认识到 SSM 默认选择的余弦相似度函数限制了其学习表示向量大小的能力。因此，将 SSM 与同样无法调整大小的模型（如矩阵因式分解）相结合，可能会导致较差的表示。我们进一步提供了数学证明，即图卷积网络中的消息传递方案可以根据节点度调整表示量，这自然弥补了 SSM 的不足。在四个基准数据集上进行的广泛实验证明了我们的分析，证明了 SSM 在项目推荐方面的优越性。我们的实现可在 TensorFlow 和 PyTorch 中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

14.30%

发文量

165

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Information Systems (TOIS) publishes papers on information retrieval (such as search engines, recommender systems) that contain: new principled information retrieval models or algorithms with sound empirical validation; observational, experimental and/or theoretical studies yielding new insights into information retrieval or information seeking; accounts of applications of existing information retrieval techniques that shed light on the strengths and weaknesses of the techniques; formalization of new information retrieval or information seeking tasks and of methods for evaluating the performance on those tasks; development of content (text, image, speech, video, etc) analysis methods to support information retrieval and information seeking; development of computational models of user information preferences and interaction behaviors; creation and analysis of evaluation methodologies for information retrieval and information seeking; or surveys of existing work that propose a significant synthesis. The information retrieval scope of ACM Transactions on Information Systems (TOIS) appeals to industry practitioners for its wealth of creative ideas, and to academic researchers for its descriptions of their colleagues'' work.