Improving access to scientific literature: a semantic IR perspective

Proceedings of the 5th Spanish Conference on Information Retrieval Pub Date : 2018-06-26 DOI:10.1145/3230599.3230601

D. Buscaldi

{"title":"Improving access to scientific literature: a semantic IR perspective","authors":"D. Buscaldi","doi":"10.1145/3230599.3230601","DOIUrl":null,"url":null,"abstract":"Nowadays, the flow of data and publications in almost every field of research is continuously growing. Some estimates place the growth rate in the number of scientific publications between 2.2% and 14% per year, depending on the type and the domain of the publication [6]. This data deluge presents a bottleneck for scientific progress and a challenge for existing search engines. The problems to be solved are some old ones: the ambiguity of a concept, especially among different research fields (for instance, \"lattice\" in computer science vs. physics), and the synonymy (or quasi-synonymy) of concepts that are expressed in different ways: for instance, \"opinion mining\" and \"sentiment analysis\". These issues may affect various tasks: a researcher building a state of the art for a specific topic, an editor finding reviewers for a given paper, or a government official studying a project proposal, among others. The need to go beyond the mere document retrieval in the context of scientific literature is corroborated by the proliferation of related projects and works, and the organization of new shared tasks, in particular the ScienceIE task at SemEval-2017, focused on the identification of keyphrases representing topics, methods, data and tools [1], and task-7 at Semeval-2018 about semantic relation extraction and classification in scientific papers [3]. Some recent works address the problem with the help of structured lists of known keywords, such as Rexplore [7], which integrates statistical analysis with semantic technologies, or by analyzing the citation network among various papers, such as in CiteSpace [2]. In most cases, the relevance, or impact, of a paper is assessed by the number of citations it receives. However, Oren Etzioni1 observed that \"Academics may cite papers for non-essential reasons - out of courtesy, for completeness or to promote their own publications. These superfluous citations can impede literature searches and exaggerate a paper's importance\" and therefore it is necessary to use Artificial Intelligence to discover the meaning and the importance of a specific citation. Recently, at LIPN we started working on the access to scientific information from a semantic information retrieval perspective, therefore leveraging the use of ontologies and similar semantic resources for this task. The first step has been to build a typology of semantic relations [4] that are often used in state of the art sections of scientific paper. Some of these relations link methods and the problems they solve, others link a resource and a system that used it. This typology can evolve or be integrated into more complex ontologies. The next step was to verify whether it is possible to detect these relations automatically. We focused on unsupervised methods that exploit the information coming from keywords and patterns around the entities that are connected by the relations, and tested the possibility to improve these results using semantic embeddings [5]. We produced a set of annotated documents that were used for task-7 at SemEval 2018, where various participants showed the effectiveness of Deep Neural Networks (DNN) methods to detect and classify the relations [3]. The results show that these methods are usually able to predict with a high accuracy (85 - 90%) the type of a relation, if they are fed the information about the linked entities, but there is still a lot of work to be done for the detection of the relations (~ 50% for the best system).","PeriodicalId":448209,"journal":{"name":"Proceedings of the 5th Spanish Conference on Information Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Spanish Conference on Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3230599.3230601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, the flow of data and publications in almost every field of research is continuously growing. Some estimates place the growth rate in the number of scientific publications between 2.2% and 14% per year, depending on the type and the domain of the publication [6]. This data deluge presents a bottleneck for scientific progress and a challenge for existing search engines. The problems to be solved are some old ones: the ambiguity of a concept, especially among different research fields (for instance, "lattice" in computer science vs. physics), and the synonymy (or quasi-synonymy) of concepts that are expressed in different ways: for instance, "opinion mining" and "sentiment analysis". These issues may affect various tasks: a researcher building a state of the art for a specific topic, an editor finding reviewers for a given paper, or a government official studying a project proposal, among others. The need to go beyond the mere document retrieval in the context of scientific literature is corroborated by the proliferation of related projects and works, and the organization of new shared tasks, in particular the ScienceIE task at SemEval-2017, focused on the identification of keyphrases representing topics, methods, data and tools [1], and task-7 at Semeval-2018 about semantic relation extraction and classification in scientific papers [3]. Some recent works address the problem with the help of structured lists of known keywords, such as Rexplore [7], which integrates statistical analysis with semantic technologies, or by analyzing the citation network among various papers, such as in CiteSpace [2]. In most cases, the relevance, or impact, of a paper is assessed by the number of citations it receives. However, Oren Etzioni1 observed that "Academics may cite papers for non-essential reasons - out of courtesy, for completeness or to promote their own publications. These superfluous citations can impede literature searches and exaggerate a paper's importance" and therefore it is necessary to use Artificial Intelligence to discover the meaning and the importance of a specific citation. Recently, at LIPN we started working on the access to scientific information from a semantic information retrieval perspective, therefore leveraging the use of ontologies and similar semantic resources for this task. The first step has been to build a typology of semantic relations [4] that are often used in state of the art sections of scientific paper. Some of these relations link methods and the problems they solve, others link a resource and a system that used it. This typology can evolve or be integrated into more complex ontologies. The next step was to verify whether it is possible to detect these relations automatically. We focused on unsupervised methods that exploit the information coming from keywords and patterns around the entities that are connected by the relations, and tested the possibility to improve these results using semantic embeddings [5]. We produced a set of annotated documents that were used for task-7 at SemEval 2018, where various participants showed the effectiveness of Deep Neural Networks (DNN) methods to detect and classify the relations [3]. The results show that these methods are usually able to predict with a high accuracy (85 - 90%) the type of a relation, if they are fed the information about the linked entities, but there is still a lot of work to be done for the detection of the relations (~ 50% for the best system).

查看原文本刊更多论文

改进科学文献的获取:语义IR视角

如今，几乎每个研究领域的数据和出版物的流量都在不断增长。一些人估计，科学出版物数量的年增长率在2.2%到14%之间，这取决于出版物的类型和领域[6]。这种数据洪流给科学进步带来了瓶颈，也给现有的搜索引擎带来了挑战。要解决的问题是一些老问题:概念的模糊性，特别是在不同的研究领域(例如，计算机科学与物理学中的“格”)，以及以不同方式表达的概念的同义词(或准同义词):例如，“意见挖掘”和“情感分析”。这些问题可能会影响各种任务:研究人员为特定主题建立最新技术，编辑为给定论文寻找审稿人，或者政府官员研究项目提案等等。相关项目和工作的激增，以及新的共享任务的组织，证实了在科学文献背景下超越单纯的文档检索的需求，特别是SemEval-2017上的ScienceIE任务，重点是识别代表主题、方法、数据和工具的关键短语[1]，Semeval-2018上的任务7是关于科学论文的语义关系提取和分类[3]。最近的一些研究借助结构化的已知关键词列表解决了这个问题，如reexplore[7]，它将统计分析与语义技术相结合，或者通过分析各种论文之间的引文网络，如CiteSpace[2]。在大多数情况下，论文的相关性或影响力是通过其被引用的次数来评估的。然而，Oren etzioni观察到，“学者可能会出于非必要的原因引用论文——出于礼貌、完整性或推广自己的出版物。这些多余的引文会阻碍文献检索，夸大论文的重要性”，因此有必要使用人工智能来发现特定引文的意义和重要性。最近，在LIPN，我们开始从语义信息检索的角度研究对科学信息的访问，因此利用本体和类似的语义资源来完成这项任务。第一步是建立语义关系的类型学[4]，通常用于科学论文的最先进部分。其中一些关系将方法和它们解决的问题联系起来，另一些关系将资源和使用它的系统联系起来。这种类型可以进化或集成到更复杂的本体中。下一步是验证是否有可能自动检测这些关系。我们专注于无监督的方法，这些方法利用由关系连接的实体周围的关键字和模式的信息，并测试了使用语义嵌入改进这些结果的可能性[5]。我们在SemEval 2018上制作了一组用于任务7的注释文档，其中许多参与者展示了深度神经网络(DNN)方法检测和分类关系的有效性[3]。结果表明，如果向这些方法提供有关链接实体的信息，这些方法通常能够以较高的准确率(85 - 90%)预测关系的类型，但是在关系的检测方面仍有很多工作要做(最好的系统约为50%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th Spanish Conference on Information Retrieval

自引率

0.00%

发文量