On Horizontal and Vertical Separation in Hierarchical Text Classification

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval Pub Date : 2016-09-02 DOI:10.1145/2970398.2970408

Mostafa Dehghani, H. Azarbonyad, J. Kamps, maarten marx

{"title":"On Horizontal and Vertical Separation in Hierarchical Text Classification","authors":"Mostafa Dehghani, H. Azarbonyad, J. Kamps, maarten marx","doi":"10.1145/2970398.2970408","DOIUrl":null,"url":null,"abstract":"Hierarchy is an effective and common way of organizing data and representing their relationships at different levels of abstraction. However, hierarchical data dependencies cause difficulties in the estimation of \"separable\" models that can distinguish between the entities in the hierarchy. Extracting separable models of hierarchical entities requires us to take their relative position into account and to consider the different types of dependencies in the hierarchy. In this paper, we present an investigation of the effect of separability in text-based entity classification and argue that in hierarchical classification, a separation property should be established between entities not only in the same layer, but also in different layers. Our main findings are the followings. First, we analyse the importance of separability on the data representation in the task of classification and based on that, we introduce \"Strong Separation Principle\" for optimizing expected effectiveness of classifiers decision based on separation property. Second, we present Significant Words Language Models (SWLM) which capture all, and only, the essential features of hierarchical entities according to their relative position in the hierarchy resulting in horizontally and vertically separable models. Third, we validate our claims on real world data and demonstrate that how SWLM improves the accuracy of classification and how it provides transferable models over time. Although discussions in this paper focus on the classification problem, the models are applicable to any information access tasks on data that has, or can be mapped to, a hierarchical structure.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2970398.2970408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Hierarchy is an effective and common way of organizing data and representing their relationships at different levels of abstraction. However, hierarchical data dependencies cause difficulties in the estimation of "separable" models that can distinguish between the entities in the hierarchy. Extracting separable models of hierarchical entities requires us to take their relative position into account and to consider the different types of dependencies in the hierarchy. In this paper, we present an investigation of the effect of separability in text-based entity classification and argue that in hierarchical classification, a separation property should be established between entities not only in the same layer, but also in different layers. Our main findings are the followings. First, we analyse the importance of separability on the data representation in the task of classification and based on that, we introduce "Strong Separation Principle" for optimizing expected effectiveness of classifiers decision based on separation property. Second, we present Significant Words Language Models (SWLM) which capture all, and only, the essential features of hierarchical entities according to their relative position in the hierarchy resulting in horizontally and vertically separable models. Third, we validate our claims on real world data and demonstrate that how SWLM improves the accuracy of classification and how it provides transferable models over time. Although discussions in this paper focus on the classification problem, the models are applicable to any information access tasks on data that has, or can be mapped to, a hierarchical structure.

查看原文本刊更多论文

层次文本分类中的水平和垂直分离

层次结构是组织数据和在不同抽象层次上表示它们之间关系的一种有效而通用的方法。然而，分层数据依赖会导致在估计能够区分层次结构中的实体的“可分离”模型时出现困难。提取层次实体的可分离模型需要我们考虑它们的相对位置，并考虑层次中不同类型的依赖关系。本文研究了可分离性在基于文本的实体分类中的作用，认为在分层分类中，不仅要在同一层实体之间建立可分离性，而且要在不同层实体之间建立可分离性。我们的主要发现如下。首先，我们分析了可分离性在分类任务中对数据表示的重要性，在此基础上，我们引入了“强分离原则”来优化基于分离性的分类器决策的预期有效性。其次，我们提出了重要词语言模型(SWLM)，该模型根据层次实体在层次中的相对位置捕获所有且仅捕获层次实体的基本特征，从而形成水平和垂直可分离的模型。第三，我们在真实世界的数据上验证了我们的主张，并演示了SWLM如何提高分类的准确性，以及它如何随时间提供可转移的模型。虽然本文讨论的重点是分类问题，但这些模型适用于具有或可以映射到层次结构的数据的任何信息访问任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

自引率

0.00%

发文量