On the Hardness of Category Tree Construction

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory Pub Date : 2022-01-01 DOI:10.4230/LIPIcs.ICDT.2022.4

Shay Gershtein, Uri Avron, Ido Guy, T. Milo, Slava Novgorodov

{"title":"On the Hardness of Category Tree Construction","authors":"Shay Gershtein, Uri Avron, Ido Guy, T. Milo, Slava Novgorodov","doi":"10.4230/LIPIcs.ICDT.2022.4","DOIUrl":null,"url":null,"abstract":"Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order ˜Θ( √ n ) or ˜Θ( n ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":"26 1","pages":"4:1-4:17"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2022.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order ˜Θ( √ n ) or ˜Θ( n ), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

查看原文本刊更多论文

论类别树构造的硬度

类别树或分类法是有根的树，其中每个节点(称为类别)对应于一组相关项目。分类法的构建已经在各个领域进行了研究，包括电子商务、文档管理和问答。已经提出了多种自动化施工算法，采用各种聚类方法和众包。然而，没有正式的模型来捕捉这些分类问题，其复杂性也没有研究。为了解决这个问题，我们在这项工作中提出了一个组合模型，该模型捕获了许多实际设置，并表明上述经验方法是有保证的，因为我们证明了各种问题变体和特殊情况的强不可逼近性界限，当目标是产生最大效用的分类时。在我们的模型中，输入是一组n个加权项目集，理想情况下，树将包含这些项目集作为类别。每个类别，而不是完全匹配相应的输入集，允许超过给定相似性函数的给定阈值。目标是生成一个树，使包含匹配类别的集合的总权重最大化。关键参数是一个项目可能属于的类别数量的上界，它产生了问题的难度，因为最初每个项目可能包含在任意数量的输入集中。对于这个模型，我们证明了对于各种问题变体和特殊情况的阶~ Θ(√n)或阶~ Θ(n)的不可逼近性界限，松散地证明了上述启发式方法的合理性。我们的工作包括基于参数化随机结构的约简，突出了各种问题参数和输入属性如何影响硬度。此外，对于类别必须与相应输入集相同的特殊情况，我们设计了一种算法，其近似保证仅依赖于更细粒度的参数，从而允许改进的最坏情况保证。最后，我们还将我们的结果推广到基于dag的非分层分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

自引率

0.00%

发文量