Deep Learning for Extreme Multi-label Text Classification

Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, Yiming Yang
{"title":"Deep Learning for Extreme Multi-label Text Classification","authors":"Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, Yiming Yang","doi":"10.1145/3077136.3080834","DOIUrl":null,"url":null,"abstract":"Extreme multi-label text classification (XMTC) refers to the problem of assigning to each document its most relevant subset of class labels from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions. The huge label space raises research challenges such as data sparsity and scalability. Significant progress has been made in recent years by the development of new machine learning methods, such as tree induction with large-margin partitions of the instance spaces and label-vector embedding in the target space. However, deep learning has not been explored for XMTC, despite its big successes in other related areas. This paper presents the first attempt at applying deep learning to XMTC, with a family of new Convolutional Neural Network (CNN) models which are tailored for multi-label classification in particular. With a comparative evaluation of 7 state-of-the-art methods on 6 benchmark datasets where the number of labels is up to 670,000, we show that the proposed CNN approach successfully scaled to the largest datasets, and consistently produced the best or the second best results on all the datasets. On the Wikipedia dataset with over 2 million documents and 500,000 labels in particular, it outperformed the second best method by 11.7%~15.3% in precision@K and by 11.5%~11.7% in NDCG@K for K = 1,3,5.","PeriodicalId":286283,"journal":{"name":"Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"524","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3077136.3080834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 524

Abstract

Extreme multi-label text classification (XMTC) refers to the problem of assigning to each document its most relevant subset of class labels from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions. The huge label space raises research challenges such as data sparsity and scalability. Significant progress has been made in recent years by the development of new machine learning methods, such as tree induction with large-margin partitions of the instance spaces and label-vector embedding in the target space. However, deep learning has not been explored for XMTC, despite its big successes in other related areas. This paper presents the first attempt at applying deep learning to XMTC, with a family of new Convolutional Neural Network (CNN) models which are tailored for multi-label classification in particular. With a comparative evaluation of 7 state-of-the-art methods on 6 benchmark datasets where the number of labels is up to 670,000, we show that the proposed CNN approach successfully scaled to the largest datasets, and consistently produced the best or the second best results on all the datasets. On the Wikipedia dataset with over 2 million documents and 500,000 labels in particular, it outperformed the second best method by 11.7%~15.3% in precision@K and by 11.5%~11.7% in NDCG@K for K = 1,3,5.
极端多标签文本分类的深度学习
极端多标签文本分类(XMTC)指的是从一个非常大的标签集合中为每个文档分配最相关的类标签子集的问题,其中标签的数量可能达到数十万或数百万。巨大的标签空间带来了数据稀疏性和可扩展性等研究挑战。近年来,新的机器学习方法的发展取得了重大进展,例如实例空间大边界分区的树归纳和目标空间中的标签向量嵌入。然而,尽管XMTC在其他相关领域取得了巨大成功,但还没有对其进行深度学习的探索。本文提出了将深度学习应用于XMTC的第一次尝试,使用了一系列新的卷积神经网络(CNN)模型,这些模型特别针对多标签分类进行了定制。通过对7种最先进的方法在6个基准数据集上的比较评估,其中标签数量高达670,000,我们表明,所提出的CNN方法成功地扩展到最大的数据集,并始终在所有数据集上产生最佳或次优结果。在拥有超过200万个文档和500,000个标签的维基百科数据集上,当K = 1,3,5时,它在precision@K上的表现比第二种方法好11.7%~15.3%,在NDCG@K上的表现比第二种方法好11.5%~11.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信