{"title":"Improving access to scientific literature: a semantic IR perspective","authors":"D. Buscaldi","doi":"10.1145/3230599.3230601","DOIUrl":null,"url":null,"abstract":"Nowadays, the flow of data and publications in almost every field of research is continuously growing. Some estimates place the growth rate in the number of scientific publications between 2.2% and 14% per year, depending on the type and the domain of the publication [6]. This data deluge presents a bottleneck for scientific progress and a challenge for existing search engines. The problems to be solved are some old ones: the ambiguity of a concept, especially among different research fields (for instance, \"lattice\" in computer science vs. physics), and the synonymy (or quasi-synonymy) of concepts that are expressed in different ways: for instance, \"opinion mining\" and \"sentiment analysis\". These issues may affect various tasks: a researcher building a state of the art for a specific topic, an editor finding reviewers for a given paper, or a government official studying a project proposal, among others. The need to go beyond the mere document retrieval in the context of scientific literature is corroborated by the proliferation of related projects and works, and the organization of new shared tasks, in particular the ScienceIE task at SemEval-2017, focused on the identification of keyphrases representing topics, methods, data and tools [1], and task-7 at Semeval-2018 about semantic relation extraction and classification in scientific papers [3]. Some recent works address the problem with the help of structured lists of known keywords, such as Rexplore [7], which integrates statistical analysis with semantic technologies, or by analyzing the citation network among various papers, such as in CiteSpace [2]. In most cases, the relevance, or impact, of a paper is assessed by the number of citations it receives. However, Oren Etzioni1 observed that \"Academics may cite papers for non-essential reasons - out of courtesy, for completeness or to promote their own publications. These superfluous citations can impede literature searches and exaggerate a paper's importance\" and therefore it is necessary to use Artificial Intelligence to discover the meaning and the importance of a specific citation. Recently, at LIPN we started working on the access to scientific information from a semantic information retrieval perspective, therefore leveraging the use of ontologies and similar semantic resources for this task. The first step has been to build a typology of semantic relations [4] that are often used in state of the art sections of scientific paper. Some of these relations link methods and the problems they solve, others link a resource and a system that used it. This typology can evolve or be integrated into more complex ontologies. The next step was to verify whether it is possible to detect these relations automatically. We focused on unsupervised methods that exploit the information coming from keywords and patterns around the entities that are connected by the relations, and tested the possibility to improve these results using semantic embeddings [5]. We produced a set of annotated documents that were used for task-7 at SemEval 2018, where various participants showed the effectiveness of Deep Neural Networks (DNN) methods to detect and classify the relations [3]. The results show that these methods are usually able to predict with a high accuracy (85 - 90%) the type of a relation, if they are fed the information about the linked entities, but there is still a lot of work to be done for the detection of the relations (~ 50% for the best system).","PeriodicalId":448209,"journal":{"name":"Proceedings of the 5th Spanish Conference on Information Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Spanish Conference on Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3230599.3230601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Nowadays, the flow of data and publications in almost every field of research is continuously growing. Some estimates place the growth rate in the number of scientific publications between 2.2% and 14% per year, depending on the type and the domain of the publication [6]. This data deluge presents a bottleneck for scientific progress and a challenge for existing search engines. The problems to be solved are some old ones: the ambiguity of a concept, especially among different research fields (for instance, "lattice" in computer science vs. physics), and the synonymy (or quasi-synonymy) of concepts that are expressed in different ways: for instance, "opinion mining" and "sentiment analysis". These issues may affect various tasks: a researcher building a state of the art for a specific topic, an editor finding reviewers for a given paper, or a government official studying a project proposal, among others. The need to go beyond the mere document retrieval in the context of scientific literature is corroborated by the proliferation of related projects and works, and the organization of new shared tasks, in particular the ScienceIE task at SemEval-2017, focused on the identification of keyphrases representing topics, methods, data and tools [1], and task-7 at Semeval-2018 about semantic relation extraction and classification in scientific papers [3]. Some recent works address the problem with the help of structured lists of known keywords, such as Rexplore [7], which integrates statistical analysis with semantic technologies, or by analyzing the citation network among various papers, such as in CiteSpace [2]. In most cases, the relevance, or impact, of a paper is assessed by the number of citations it receives. However, Oren Etzioni1 observed that "Academics may cite papers for non-essential reasons - out of courtesy, for completeness or to promote their own publications. These superfluous citations can impede literature searches and exaggerate a paper's importance" and therefore it is necessary to use Artificial Intelligence to discover the meaning and the importance of a specific citation. Recently, at LIPN we started working on the access to scientific information from a semantic information retrieval perspective, therefore leveraging the use of ontologies and similar semantic resources for this task. The first step has been to build a typology of semantic relations [4] that are often used in state of the art sections of scientific paper. Some of these relations link methods and the problems they solve, others link a resource and a system that used it. This typology can evolve or be integrated into more complex ontologies. The next step was to verify whether it is possible to detect these relations automatically. We focused on unsupervised methods that exploit the information coming from keywords and patterns around the entities that are connected by the relations, and tested the possibility to improve these results using semantic embeddings [5]. We produced a set of annotated documents that were used for task-7 at SemEval 2018, where various participants showed the effectiveness of Deep Neural Networks (DNN) methods to detect and classify the relations [3]. The results show that these methods are usually able to predict with a high accuracy (85 - 90%) the type of a relation, if they are fed the information about the linked entities, but there is still a lot of work to be done for the detection of the relations (~ 50% for the best system).