Hawk: An industrial-strength multi-label document classifier

Arshad Javeed
{"title":"Hawk: An industrial-strength multi-label document classifier","authors":"Arshad Javeed","doi":"10.1016/j.nlp.2024.100115","DOIUrl":null,"url":null,"abstract":"<div><div>There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model’s predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"9 ","pages":"Article 100115"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

There are a plethora of methods for solving the classical multi-label document classification problem. However, when it comes to deployment and usage in an industry setting, most if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i) ability to operate on variable-length texts or rambling documents, ii) catastrophic forgetting problem, and iii) ability to visualize the model’s predictions. The paper describes the significance of these problems in detail and adopts the hydranet architecture to address these problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation, turning the problem into a sequence classification task. Furthermore, two specific architectures are explored as the architectures for the heads, Bi-LSTM and transformer heads. The proposed architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model performs at least as best as previous SOTA architectures and even outperforms prior SOTA in a few cases, along with the added advantages of the practicality issues discussed. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet. The claims regarding catastrophic forgetfulness are further corroborated by empirical evaluations under incremental learning scenarios. The results reveal the robustness of the proposed architecture compared to other benchmarks.
霍克工业级多标签文档分类器
解决经典的多标签文档分类问题的方法层出不穷。然而,当涉及到在行业环境中的部署和使用时,大多数(如果不是全部的话)当代方法都未能解决理想解决方案的一些重要方面或要求:i) 对长度可变的文本或漫无边际的文档进行操作的能力;ii) 灾难性遗忘问题;iii) 可视化模型预测的能力。本文详细描述了这些问题的重要性,并采用 hydranet 架构来解决这些问题。所提出的架构将文档视为句子序列,并利用句子级嵌入进行输入表示,从而将问题转化为序列分类任务。此外,我们还探索了两种特定的架构,即 Bi-LSTM 和转换头架构。提议的架构在一些流行的基准数据集上进行了基准测试,如 Web of Science - 5763、Web of Science - 11967、BBC Sports 和 BBC News 数据集。实验结果表明,拟议模型的性能至少不亚于之前的 SOTA 架构,在少数情况下甚至优于之前的 SOTA 架构,同时还具有所讨论的实用性问题带来的额外优势。消融研究包括对注意力机制的影响进行比较,以及应用加权损失函数来训练水网中特定任务的头部。在增量学习场景下进行的经验评估进一步证实了关于灾难性遗忘的说法。结果表明,与其他基准相比,所提出的架构非常稳健。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信