Methods and techniques to automatic entity linking in Russian

Trudy Instituta sistemnogo programmirovaniia RAN Pub Date : 2022-01-01 DOI:10.15514/ispras-2022-34(4)-13

A. Mezentseva, E. Bruches, Tatiana Batura

{"title":"Methods and techniques to automatic entity linking in Russian","authors":"A. Mezentseva, E. Bruches, Tatiana Batura","doi":"10.15514/ispras-2022-34(4)-13","DOIUrl":null,"url":null,"abstract":"Nowadays, there is a growing interest in solving NLP tasks using external knowledge storage, for example, in information retrieval, question-answering systems, dialogue systems, etc. Thus it is important to establish relations between entities in the processed text and a knowledge base. This article is devoted to entity linking, where Wikidata is used as an external knowledge base. We consider scientific terms in Russian as entities. Traditional entity linking system has three stages: entity recognition, candidates (from knowledge base) generation, and candidate ranking. Our system takes raw text with the defined terms in it as input. To generate candidates we use string match between terms in the input text and entities from Wikidata. The candidate ranking stage is the most complicated one because it requires semantic information. Several experiments for the candidate ranking stage with different models were conducted, including the approach based on cosine similarity, classical machine learning algorithms, and neural networks. Also, we extended the RUSERRC dataset, adding manually annotated data for model training. The results showed that the approach based on cosine similarity leads to better results compared to others and doesn’t require manually annotated data. The dataset and system are open-sourced and available for other researchers.","PeriodicalId":33459,"journal":{"name":"Trudy Instituta sistemnogo programmirovaniia RAN","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Trudy Instituta sistemnogo programmirovaniia RAN","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15514/ispras-2022-34(4)-13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, there is a growing interest in solving NLP tasks using external knowledge storage, for example, in information retrieval, question-answering systems, dialogue systems, etc. Thus it is important to establish relations between entities in the processed text and a knowledge base. This article is devoted to entity linking, where Wikidata is used as an external knowledge base. We consider scientific terms in Russian as entities. Traditional entity linking system has three stages: entity recognition, candidates (from knowledge base) generation, and candidate ranking. Our system takes raw text with the defined terms in it as input. To generate candidates we use string match between terms in the input text and entities from Wikidata. The candidate ranking stage is the most complicated one because it requires semantic information. Several experiments for the candidate ranking stage with different models were conducted, including the approach based on cosine similarity, classical machine learning algorithms, and neural networks. Also, we extended the RUSERRC dataset, adding manually annotated data for model training. The results showed that the approach based on cosine similarity leads to better results compared to others and doesn’t require manually annotated data. The dataset and system are open-sourced and available for other researchers.

查看原文本刊更多论文

俄文实体自动链接的方法与技术

目前，人们对利用外部知识存储来解决NLP任务越来越感兴趣，例如在信息检索、问答系统、对话系统等方面。因此，在处理后的文本中建立实体与知识库之间的关系非常重要。本文专门讨论实体链接，其中使用维基数据作为外部知识库。我们把俄语中的科学术语视为实体。传统的实体链接系统有三个阶段:实体识别、候选对象(从知识库中)生成和候选对象排序。我们的系统将包含已定义术语的原始文本作为输入。为了生成候选词，我们使用输入文本中的词与Wikidata中的实体之间的字符串匹配。候选排序阶段是最复杂的一个阶段，因为它需要语义信息。采用余弦相似度、经典机器学习算法和神经网络等不同模型对候选排序阶段进行了实验。此外，我们扩展了RUSERRC数据集，为模型训练添加了手动注释的数据。结果表明，基于余弦相似度的方法与其他方法相比可以获得更好的结果，并且不需要手动注释数据。数据集和系统是开源的，可供其他研究人员使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Trudy Instituta sistemnogo programmirovaniia RAN

自引率

0.00%

发文量

审稿时长

4 weeks