{"title":"利用对比学习解决产品分类中的长尾问题","authors":"L. Chen, Tianqi Wang","doi":"10.1145/3511808.3557522","DOIUrl":null,"url":null,"abstract":"Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative (FN) instances may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Utilizing Contrastive Learning To Address Long Tail Issue in Product Categorization\",\"authors\":\"L. Chen, Tianqi Wang\",\"doi\":\"10.1145/3511808.3557522\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative (FN) instances may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.\",\"PeriodicalId\":389624,\"journal\":{\"name\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3511808.3557522\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557522","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Utilizing Contrastive Learning To Address Long Tail Issue in Product Categorization
Neural network models trained in a supervised learning way have become dominant. Although high performances can be achieved when training data is ample, the performance on labels with sparse training instances can be poor. This performance drift caused by imbalanced data is named as long tail issue and impacts many NN models used in reality. In this talk, we will firstly review machine learning approaches addressing the long-tail issue. Next, we will report on our effort on applying one recent LT-addressing method on the item categorization (IC) task that aims to classify product description texts into leaf nodes in a category taxonomy tree. In particular, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations using the K-positive contrastive loss (KCL) and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative (FN) instances may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.