Deep Learning for Extreme Multi-label Text Classification

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2017-08-07 DOI:10.1145/3077136.3080834

Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, Yiming Yang

{"title":"Deep Learning for Extreme Multi-label Text Classification","authors":"Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, Yiming Yang","doi":"10.1145/3077136.3080834","DOIUrl":null,"url":null,"abstract":"Extreme multi-label text classification (XMTC) refers to the problem of assigning to each document its most relevant subset of class labels from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions. The huge label space raises research challenges such as data sparsity and scalability. Significant progress has been made in recent years by the development of new machine learning methods, such as tree induction with large-margin partitions of the instance spaces and label-vector embedding in the target space. However, deep learning has not been explored for XMTC, despite its big successes in other related areas. This paper presents the first attempt at applying deep learning to XMTC, with a family of new Convolutional Neural Network (CNN) models which are tailored for multi-label classification in particular. With a comparative evaluation of 7 state-of-the-art methods on 6 benchmark datasets where the number of labels is up to 670,000, we show that the proposed CNN approach successfully scaled to the largest datasets, and consistently produced the best or the second best results on all the datasets. On the Wikipedia dataset with over 2 million documents and 500,000 labels in particular, it outperformed the second best method by 11.7%~15.3% in precision@K and by 11.5%~11.7% in NDCG@K for K = 1,3,5.","PeriodicalId":286283,"journal":{"name":"Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"524","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3077136.3080834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 524

Abstract

Extreme multi-label text classification (XMTC) refers to the problem of assigning to each document its most relevant subset of class labels from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions. The huge label space raises research challenges such as data sparsity and scalability. Significant progress has been made in recent years by the development of new machine learning methods, such as tree induction with large-margin partitions of the instance spaces and label-vector embedding in the target space. However, deep learning has not been explored for XMTC, despite its big successes in other related areas. This paper presents the first attempt at applying deep learning to XMTC, with a family of new Convolutional Neural Network (CNN) models which are tailored for multi-label classification in particular. With a comparative evaluation of 7 state-of-the-art methods on 6 benchmark datasets where the number of labels is up to 670,000, we show that the proposed CNN approach successfully scaled to the largest datasets, and consistently produced the best or the second best results on all the datasets. On the Wikipedia dataset with over 2 million documents and 500,000 labels in particular, it outperformed the second best method by 11.7%~15.3% in precision@K and by 11.5%~11.7% in NDCG@K for K = 1,3,5.

查看原文本刊更多论文

极端多标签文本分类的深度学习

极端多标签文本分类(XMTC)指的是从一个非常大的标签集合中为每个文档分配最相关的类标签子集的问题，其中标签的数量可能达到数十万或数百万。巨大的标签空间带来了数据稀疏性和可扩展性等研究挑战。近年来，新的机器学习方法的发展取得了重大进展，例如实例空间大边界分区的树归纳和目标空间中的标签向量嵌入。然而，尽管XMTC在其他相关领域取得了巨大成功，但还没有对其进行深度学习的探索。本文提出了将深度学习应用于XMTC的第一次尝试，使用了一系列新的卷积神经网络(CNN)模型，这些模型特别针对多标签分类进行了定制。通过对7种最先进的方法在6个基准数据集上的比较评估，其中标签数量高达670,000，我们表明，所提出的CNN方法成功地扩展到最大的数据集，并始终在所有数据集上产生最佳或次优结果。在拥有超过200万个文档和500,000个标签的维基百科数据集上，当K = 1,3,5时，它在precision@K上的表现比第二种方法好11.7%~15.3%，在NDCG@K上的表现比第二种方法好11.5%~11.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量