Cost-Effective Conceptual Design for Information Extraction

IF 2.2 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett
{"title":"Cost-Effective Conceptual Design for Information Extraction","authors":"Arash Termehchy, A. Vakilian, Yodsawalai Chodpathumwan, M. Winslett","doi":"10.1145/2716321","DOIUrl":null,"url":null,"abstract":"It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"44 1","pages":"12:1-12:39"},"PeriodicalIF":2.2000,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2716321","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improves the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this article, we introduce and formally define the problem of cost-effective conceptual design where, given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that most improves the effectiveness of answering queries over the collection. We provide efficient algorithms for special cases of the problem and prove it is generally NP-hard in the number of relevant concepts. We propose three efficient approximations to solve the problem: a greedy algorithm, an approximate popularity maximization (APM for short), and approximate annotation-benefit maximization (AAM for short). We show that, if there are no constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, the greedy algorithm delivers a constant approximation ratio if the concepts are equally costly, APM has a constant approximation ratio, and AAM is a fully polynomial-time approximation scheme. Our empirical results using a Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that, in general, APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.
具有成本效益的信息提取概念设计
可以确定的是,在非结构化文本文档集合中提取和注释实体及其概念的出现,可以提高对该集合回答查询的效率。但是,创建和维护大型带注释的集合非常耗费资源。由于企业的可用资源有限和/或其用户可能有紧急的信息需求,因此可能只能选择相关概念的一个子集进行提取和注释。我们把这个子集称为带注释集合的概念设计。在本文中,我们将介绍并正式定义具有成本效益的概念设计问题,在给定集合、一组相关概念和固定预算的情况下,人们喜欢找到最能提高对集合回答查询效率的概念设计。我们为该问题的特殊情况提供了有效的算法,并证明了它在相关概念的数量上一般是np困难的。我们提出了三种有效的近似方法来解决这个问题:贪婪算法、近似人气最大化(简称APM)和近似标注效益最大化(简称AAM)。我们证明,如果没有关于概念重叠的约束,APM是一个完全多项式时间逼近格式。我们还证明了如果相关概念是互斥的,如果概念的代价相等,贪婪算法提供一个常数近似比,APM具有常数近似比,而AAM是一个完全多项式时间近似方案。我们使用维基百科集合和搜索引擎查询日志的实证结果验证了提出的问题形式化,并表明APM和AAM有效地计算概念设计。它们还表明,一般来说,如果相关概念不是相互排斥的,APM可以提供最佳的概念设计。此外,如果相关概念是相互排斥的,那么AAM提供的概念设计比APM提供的解决方案更能提高对集合进行查询的效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Database Systems
ACM Transactions on Database Systems 工程技术-计算机:软件工程
CiteScore
5.60
自引率
0.00%
发文量
15
审稿时长
>12 weeks
期刊介绍: Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信