Automated Category Tree Construction: Hardness Bounds and Algorithms

IF 2.2 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov
{"title":"Automated Category Tree Construction: Hardness Bounds and Algorithms","authors":"Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov","doi":"10.1145/3664283","DOIUrl":null,"url":null,"abstract":"<p>Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. </p><p>In our model, the input is a set of <i>n</i> weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. </p><p>For this model, we prove inapproximability bounds, of order \\(\\tilde{\\Theta }(\\sqrt {n}) \\) or \\(\\tilde{\\Theta }(n) \\), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"2016 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664283","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility.

In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets.

For this model, we prove inapproximability bounds, of order \(\tilde{\Theta }(\sqrt {n}) \) or \(\tilde{\Theta }(n) \), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees, as well as the application of practical exact solvers. We further provide efficient algorithms with much improved approximation guarantees for practical special cases where the cardinalities of the input sets or the number of input sets each items belongs to are not too large. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

自动分类树构建:硬度界限和算法
分类树或分类法是有根的树,其中每个节点(称为类别)对应一组相关项目。分类法的构建已在多个领域得到研究,包括电子商务、文档管理和问题解答。目前已经提出了多种自动构建算法,其中包括各种聚类方法和众包方法。然而,目前还没有设计出捕捉此类分类问题的正式模型,也没有对其复杂性进行过研究。为了解决这个问题,我们在这项工作中提出了一个组合模型,它能捕捉到许多实际情况,并证明上述经验方法是有道理的,因为当目标是产生最大效用的分类时,我们证明了各种问题变体和特例的强不可逼近性边界。在我们的模型中,输入是一组 n 个加权项目集,理想情况下,树会将这些项目集作为类别包含在内。每个类别不是完全匹配相应的输入集,而是允许超过给定相似度函数的给定阈值。我们的目标是生成一棵树,使其包含匹配类别的集合的总权重最大化。一个关键参数是一个项目可能属于的类别数量的上限,它决定了问题的难易程度,因为最初每个项目可能包含在任意数量的输入集合中。对于这个模型,我们证明了各种问题变体和特例的不可逼近性边界,阶数为\(\tilde/{Theta }(\sqrt {n}) \)或\(\tilde/{Theta }(n) \),这在一定程度上证明了上述启发式方法的合理性。我们的工作包括基于参数化随机构造的还原,这些还原突出了各种问题参数和输入属性可能如何影响硬度。此外,对于类别必须与相应输入集完全相同的特殊情况,我们设计了一种算法,其近似保证仅取决于一个更细化的参数,从而改进了最坏情况保证,并应用了实用的精确求解器。我们还进一步提供了高效算法,在输入集的万有引力或每个项所属的输入集数量不太大的实际特殊情况下,其近似保证得到了极大改善。最后,我们还将结果推广到了基于 DAG 的非层次分类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Database Systems
ACM Transactions on Database Systems 工程技术-计算机:软件工程
CiteScore
5.60
自引率
0.00%
发文量
15
审稿时长
>12 weeks
期刊介绍: Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信