具有成本效益的信息提取概念设计

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems Pub Date : 2015-06-30 DOI:10.1145/2716321

Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett

{"title":"具有成本效益的信息提取概念设计","authors":"Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett","doi":"10.1145/2716321","DOIUrl":null,"url":null,"abstract":"It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"44 1","pages":"12:1-12:39"},"PeriodicalIF":2.2000,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cost-Effective Conceptual Design for Information Extraction\",\"authors\":\"Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett\",\"doi\":\"10.1145/2716321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.\",\"PeriodicalId\":50915,\"journal\":{\"name\":\"ACM Transactions on Database Systems\",\"volume\":\"44 1\",\"pages\":\"12:1-12:39\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2015-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Database Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/2716321\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2716321","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

可以确定的是，在非结构化文本文档集合中提取和注释实体及其概念的出现，可以提高对该集合回答查询的效率。但是，创建和维护大型带注释的集合非常耗费资源。由于企业的可用资源有限和/或其用户可能有紧急的信息需求，因此可能只能选择相关概念的一个子集进行提取和注释。我们把这个子集称为带注释集合的概念设计。在本文中，我们将介绍并正式定义具有成本效益的概念设计问题，在给定集合、一组相关概念和固定预算的情况下，人们喜欢找到最能提高对集合回答查询效率的概念设计。我们为该问题的特殊情况提供了有效的算法，并证明了它在相关概念的数量上一般是np困难的。我们提出了三种有效的近似方法来解决这个问题:贪婪算法、近似人气最大化(简称APM)和近似标注效益最大化(简称AAM)。我们证明，如果没有关于概念重叠的约束，APM是一个完全多项式时间逼近格式。我们还证明了如果相关概念是互斥的，如果概念的代价相等，贪婪算法提供一个常数近似比，APM具有常数近似比，而AAM是一个完全多项式时间近似方案。我们使用维基百科集合和搜索引擎查询日志的实证结果验证了提出的问题形式化，并表明APM和AAM有效地计算概念设计。它们还表明，一般来说，如果相关概念不是相互排斥的，APM可以提供最佳的概念设计。此外，如果相关概念是相互排斥的，那么AAM提供的概念设计比APM提供的解决方案更能提高对集合进行查询的效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cost-Effective Conceptual Design for Information Extraction

It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.