Zero-Shot Taxonomy Mapping for Document Classification

IF 0.9 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Applied Computing Review Pub Date : 2023-03-27 DOI:10.1145/3555776.3577653

L. Bongiovanni, Luca Bruno, Fabrizio Dominici, Giuseppe Rizzo

{"title":"Zero-Shot Taxonomy Mapping for Document Classification","authors":"L. Bongiovanni, Luca Bruno, Fabrizio Dominici, Giuseppe Rizzo","doi":"10.1145/3555776.3577653","DOIUrl":null,"url":null,"abstract":"Classification of documents according to a custom internal hierarchical taxonomy is a common problem for many organizations that deal with textual data. Approaches aimed to address this challenge are, for the vast majority, supervised methods, which have the advantage of producing good results on specific datasets, but the major drawbacks of requiring an entire corpus of annotated documents, and the resulting models are not directly applicable to a different taxonomy. In this paper, we aim to contribute to this important issue, by proposing a method to classify text according to a custom hierarchical taxonomy entirely without the need of labelled data. The idea is to first leverage the semantic information encoded into pre-trained Deep Language Models to assigned a prior relevance score for each label of the taxonomy using zero-shot, and secondly take advantage of the hierarchical structure to reinforce this prior belief. Experiments are conducted on three hierarchically annotated datasets: WebOfScience, DBpedia Extracts and Amazon Product Reviews, which are very diverse in the type of language adopted and have taxonomy depth of two and three levels. We first compare different zero-shot methods, and then we show that our hierarchy-aware approach substantially improves results across every dataset.","PeriodicalId":42971,"journal":{"name":"Applied Computing Review","volume":"86 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555776.3577653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Classification of documents according to a custom internal hierarchical taxonomy is a common problem for many organizations that deal with textual data. Approaches aimed to address this challenge are, for the vast majority, supervised methods, which have the advantage of producing good results on specific datasets, but the major drawbacks of requiring an entire corpus of annotated documents, and the resulting models are not directly applicable to a different taxonomy. In this paper, we aim to contribute to this important issue, by proposing a method to classify text according to a custom hierarchical taxonomy entirely without the need of labelled data. The idea is to first leverage the semantic information encoded into pre-trained Deep Language Models to assigned a prior relevance score for each label of the taxonomy using zero-shot, and secondly take advantage of the hierarchical structure to reinforce this prior belief. Experiments are conducted on three hierarchically annotated datasets: WebOfScience, DBpedia Extracts and Amazon Product Reviews, which are very diverse in the type of language adopted and have taxonomy depth of two and three levels. We first compare different zero-shot methods, and then we show that our hierarchy-aware approach substantially improves results across every dataset.

查看原文本刊更多论文

用于文档分类的零采样分类法映射

对于许多处理文本数据的组织来说，根据自定义的内部分层分类法对文档进行分类是一个常见问题。对于大多数人来说，旨在解决这一挑战的方法是监督方法，它具有在特定数据集上产生良好结果的优势，但主要缺点是需要整个带注释的文档语料库，并且所得到的模型不能直接适用于不同的分类法。在本文中，我们的目标是通过提出一种完全不需要标记数据就可以根据自定义层次分类法对文本进行分类的方法来解决这个重要问题。其思想是首先利用编码到预训练深度语言模型中的语义信息，使用zero-shot为分类法的每个标签分配一个先验相关分数，然后利用分层结构来强化这种先验信念。实验在WebOfScience、DBpedia extract和Amazon Product Reviews三个分层标注的数据集上进行，这三个数据集采用的语言类型非常多样，分类深度分别为二级和三级。我们首先比较了不同的零射击方法，然后我们展示了我们的层次感知方法大大提高了每个数据集的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Computing Review COMPUTER SCIENCE, INFORMATION SYSTEMS-

自引率

40.00%

发文量