Entity summarization: State of the art and future challenges

IF 2.1 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Web Semantics Pub Date : 2021-05-01 DOI:10.1016/j.websem.2021.100647

Qingxia Liu , Gong Cheng , Kalpa Gunaratna , Yuzhong Qu

{"title":"Entity summarization: State of the art and future challenges","authors":"Qingxia Liu , Gong Cheng , Kalpa Gunaratna , Yuzhong Qu","doi":"10.1016/j.websem.2021.100647","DOIUrl":null,"url":null,"abstract":"<div>The increasing availability of semantic data has substantially enhanced Web applications. Semantic data such as RDF data is commonly represented as entity-property-value triples. The magnitude of semantic data, in particular the large number of triples describing an entity, could overload users with excessive amounts of information. This has motivated fruitful research on automated generation of summaries for entity descriptions to satisfy users’ information needs efficiently and effectively. We focus on this prominent topic of entity summarization, and our research objective is to present the first comprehensive survey of entity summarization research. Rather than separately reviewing each method, our contributions include (1) identifying and classifying technical features of existing methods to form a high-level overview, (2) identifying and classifying frameworks for combining multiple technical features adopted by existing methods, (3) collecting known benchmarks for intrinsic evaluation and efforts for extrinsic evaluation, and (4) suggesting research directions for future work. By investigating the literature, we synthesized two hierarchies of techniques. The first hierarchy categories generic technical features into several perspectives: frequency and centrality, informativeness, and diversity and coverage. In the second hierarchy we present domain-specific and task-specific technical features, including the use of domain knowledge, context awareness, and personalization. Our review demonstrated that existing methods are mainly unsupervised and they combine multiple technical features using various frameworks: random surfer models, similarity-based grouping, MMR-like re-ranking, or combinatorial optimization. We also found a few deep learning based methods in recent research. Current evaluation results and our case study showed that the problem of entity summarization is still far from being solved. Based on the limitations of existing methods revealed in the review, we identified several future directions: the use of semantics, human factors, machine and deep learning, non-extractive methods, and interactive methods.</div>","PeriodicalId":49951,"journal":{"name":"Journal of Web Semantics","volume":null,"pages":null},"PeriodicalIF":2.1000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.websem.2021.100647","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Web Semantics","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1570826821000226","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 28

Abstract

The increasing availability of semantic data has substantially enhanced Web applications. Semantic data such as RDF data is commonly represented as entity-property-value triples. The magnitude of semantic data, in particular the large number of triples describing an entity, could overload users with excessive amounts of information. This has motivated fruitful research on automated generation of summaries for entity descriptions to satisfy users’ information needs efficiently and effectively. We focus on this prominent topic of entity summarization, and our research objective is to present the first comprehensive survey of entity summarization research. Rather than separately reviewing each method, our contributions include (1) identifying and classifying technical features of existing methods to form a high-level overview, (2) identifying and classifying frameworks for combining multiple technical features adopted by existing methods, (3) collecting known benchmarks for intrinsic evaluation and efforts for extrinsic evaluation, and (4) suggesting research directions for future work. By investigating the literature, we synthesized two hierarchies of techniques. The first hierarchy categories generic technical features into several perspectives: frequency and centrality, informativeness, and diversity and coverage. In the second hierarchy we present domain-specific and task-specific technical features, including the use of domain knowledge, context awareness, and personalization. Our review demonstrated that existing methods are mainly unsupervised and they combine multiple technical features using various frameworks: random surfer models, similarity-based grouping, MMR-like re-ranking, or combinatorial optimization. We also found a few deep learning based methods in recent research. Current evaluation results and our case study showed that the problem of entity summarization is still far from being solved. Based on the limitations of existing methods revealed in the review, we identified several future directions: the use of semantics, human factors, machine and deep learning, non-extractive methods, and interactive methods.

查看原文本刊更多论文

实体总结:技术现状和未来挑战

语义数据可用性的增加极大地增强了Web应用程序。语义数据(如RDF数据)通常表示为实体-属性-值三元组。语义数据的规模，特别是描述实体的大量三元组，可能会给用户带来过多的信息负担。这推动了实体描述摘要的自动生成研究，以高效地满足用户的信息需求。我们专注于实体摘要这一突出的主题，我们的研究目标是对实体摘要研究进行第一次全面的调查。我们的贡献不是单独回顾每种方法，而是包括:(1)识别和分类现有方法的技术特征，形成一个高层次的概述;(2)识别和分类现有方法采用的多种技术特征组合的框架;(3)收集已知的内在评价基准和外在评价努力;(4)为未来的工作提出研究方向。通过研究文献，我们综合了两种技术层次。第一个层次结构将通用技术特征分为几个方面:频率和中心性、信息性、多样性和覆盖范围。在第二个层次中，我们介绍了特定于领域和特定于任务的技术特性，包括使用领域知识、上下文感知和个性化。我们的综述表明，现有的方法主要是无监督的，它们使用不同的框架结合了多种技术特征:随机冲浪者模型、基于相似性的分组、类似mmr的重新排序或组合优化。在最近的研究中，我们也发现了一些基于深度学习的方法。目前的评价结果和我们的案例研究表明，实体摘要的问题还远远没有得到解决。基于回顾中揭示的现有方法的局限性，我们确定了几个未来的方向:使用语义，人为因素，机器和深度学习，非提取方法和交互方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Web Semantics 工程技术-计算机：人工智能

CiteScore

6.20

自引率

12.00%

发文量

审稿时长

14.6 weeks

期刊介绍： The Journal of Web Semantics is an interdisciplinary journal based on research and applications of various subject areas that contribute to the development of a knowledge-intensive and intelligent service Web. These areas include: knowledge technologies, ontology, agents, databases and the semantic grid, obviously disciplines like information retrieval, language technology, human-computer interaction and knowledge discovery are of major relevance as well. All aspects of the Semantic Web development are covered. The publication of large-scale experiments and their analysis is also encouraged to clearly illustrate scenarios and methods that introduce semantics into existing Web interfaces, contents and services. The journal emphasizes the publication of papers that combine theories, methods and experiments from different subject areas in order to deliver innovative semantic methods and applications.