{"title":"Training set augmentation in training neural- network language model for ontology population","authors":"P. Lomov, M. Malozemova","doi":"10.37614/2307-5252.2021.5.12.002","DOIUrl":null,"url":null,"abstract":"This paper is a continuation of the research focused on solving the problem of ontology population using training on an automatically generated training set and the subsequent use of a neural-network language model for analyzing texts in order to discover new concepts to add to the ontology. The article is devoted to the text data augmentation - increasing the size of the training set by modification of its samples. Along with this, a solution to the problem of clarifying concepts (i.e. adjusting their boundaries in sentences), which were found during the automatic formation of the training set, is considered. A brief overview of existing approaches to text data augmentation, as well as approaches to extracting so-called nested named entities (nested NER), is presented. A procedure is proposed for clarifying the boundaries of the discovered concepts of the training set and its augmentation for subsequent training a neural-network language model in order to identify new concepts of ontology in the domain texts. The results of the experimental evaluation of the trained model and the main directions of further research are considered.","PeriodicalId":438304,"journal":{"name":"Transaction Kola Science Centre","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transaction Kola Science Centre","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37614/2307-5252.2021.5.12.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper is a continuation of the research focused on solving the problem of ontology population using training on an automatically generated training set and the subsequent use of a neural-network language model for analyzing texts in order to discover new concepts to add to the ontology. The article is devoted to the text data augmentation - increasing the size of the training set by modification of its samples. Along with this, a solution to the problem of clarifying concepts (i.e. adjusting their boundaries in sentences), which were found during the automatic formation of the training set, is considered. A brief overview of existing approaches to text data augmentation, as well as approaches to extracting so-called nested named entities (nested NER), is presented. A procedure is proposed for clarifying the boundaries of the discovered concepts of the training set and its augmentation for subsequent training a neural-network language model in order to identify new concepts of ontology in the domain texts. The results of the experimental evaluation of the trained model and the main directions of further research are considered.