{"title":"Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein Names","authors":"Kazunori Miyanishi, Tomonobu Ozaki, T. Ohkawa","doi":"10.2197/IPSJTBIO.2.93","DOIUrl":null,"url":null,"abstract":"As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.","PeriodicalId":38959,"journal":{"name":"IPSJ Transactions on Bioinformatics","volume":"2 1","pages":"93-100"},"PeriodicalIF":0.0000,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2197/IPSJTBIO.2.93","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IPSJ Transactions on Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/IPSJTBIO.2.93","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}
引用次数: 1
Abstract
As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.