利用对比学习解决产品分类中的长尾问题

L. Chen, Tianqi Wang
{"title":"利用对比学习解决产品分类中的长尾问题","authors":"L. Chen, Tianqi Wang","doi":"10.1145/3511808.3557522","DOIUrl":null,"url":null,"abstract":"Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative (FN) instances may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Utilizing Contrastive Learning To Address Long Tail Issue in Product Categorization\",\"authors\":\"L. Chen, Tianqi Wang\",\"doi\":\"10.1145/3511808.3557522\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative (FN) instances may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.\",\"PeriodicalId\":389624,\"journal\":{\"name\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3511808.3557522\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557522","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

以监督学习方式训练的神经网络模型已经成为主流。虽然在训练数据充足的情况下可以获得高性能,但在训练实例稀疏的标签上,性能可能会很差。这种由不平衡数据引起的性能漂移被称为长尾问题,它影响着现实中使用的许多神经网络模型。在本次演讲中,我们将首先回顾解决长尾问题的机器学习方法。接下来,我们将报告我们在项目分类(IC)任务上应用一种最新的lt寻址方法的努力,该任务旨在将产品描述文本分类到类别分类树中的叶节点中。特别是,我们采用了一种新方法,该方法包括将整个分类任务解耦为(a)使用k -正对比损失(KCL)学习表征和(b)在平衡数据集上训练分类器到IC任务。以SimCSE作为自学习主干,验证了该方法在IC文本分类任务上的有效性。此外,我们还发现了KCL的一个缺点:假阴性(FN)实例可能会损害表征学习步骤。在消除FN实例后,IC性能(通过macro-F1测量)得到了进一步提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Utilizing Contrastive Learning To Address Long Tail Issue in Product Categorization
Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative (FN) instances may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信