{"title":"霍克工业级多标签文档分类器","authors":"Arshad Javeed","doi":"10.1016/j.nlp.2024.100115","DOIUrl":null,"url":null,"abstract":"<div><div>There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model’s predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"9 ","pages":"Article 100115"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hawk: An industrial-strength multi-label document classifier\",\"authors\":\"Arshad Javeed\",\"doi\":\"10.1016/j.nlp.2024.100115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model’s predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.</div></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"9 \",\"pages\":\"Article 100115\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000633\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
解决经典的多标签文档分类问题的方法层出不穷。然而,当涉及到在行业环境中的部署和使用时,大多数(如果不是全部的话)当代方法都未能解决理想解决方案的一些重要方面或要求:i) 对长度可变的文本或漫无边际的文档进行操作的能力;ii) 灾难性遗忘问题;iii) 可视化模型预测的能力。本文详细描述了这些问题的重要性,并采用 hydranet 架构来解决这些问题。所提出的架构将文档视为句子序列,并利用句子级嵌入进行输入表示,从而将问题转化为序列分类任务。此外,我们还探索了两种特定的架构,即 Bi-LSTM 和转换头架构。提议的架构在一些流行的基准数据集上进行了基准测试,如 Web of Science - 5763、Web of Science - 11967、BBC Sports 和 BBC News 数据集。实验结果表明,拟议模型的性能至少不亚于之前的 SOTA 架构,在少数情况下甚至优于之前的 SOTA 架构,同时还具有所讨论的实用性问题带来的额外优势。消融研究包括对注意力机制的影响进行比较,以及应用加权损失函数来训练水网中特定任务的头部。在增量学习场景下进行的经验评估进一步证实了关于灾难性遗忘的说法。结果表明,与其他基准相比,所提出的架构非常稳健。
Hawk: An industrial-strength multi-label document classifier
There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model’s predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.