Novel Unsupervised Named Entity Recognition Used in Text Annotation Tool (OntoMate) At Rat Genome Database

O. Ghiasvand, M. Shimoyama
{"title":"Novel Unsupervised Named Entity Recognition Used in Text Annotation Tool (OntoMate) At Rat Genome Database","authors":"O. Ghiasvand, M. Shimoyama","doi":"10.1145/3107411.3108198","DOIUrl":null,"url":null,"abstract":"In model organism databases, one of the important tasks is to convert free text in biomedical literature to a structured data format. Curators in the Rat Genome Database (RGD), the primary source of rat genomic, genetic, and physiological data, spend considerable time and effort curating functional information for genes, QTLs, and strains from the literature. To increase curation efficiency and prioritize literature for data extraction OntoMate was developed at RGD. This tool tags Pubmed abstracts with genes, gene names, gene mutations, organism name and terms from 16 ontologies/vocabularies, including synonyms and aliases, used to represent functional information. In this project, we have used an unsupervised tagging method to reduce human effort for creating training data. In this approach, a machine learning tool based on decision tree classification techniques has been developed. Mentions that are uniquely belong to a semantic type play positive sample roles, and those with semantic types other than desired group are assumed to be negative samples. An interface allows the user to create a complex query incorporating terms from any of the ontologies, gene symbols, organisms, dates and other parameters. The results return abstracts along with all tagged parameters indicated in the query, along with children of the ontology terms chosen. Results can be further filtered by the user through a panel that lists organisms, genes and diseases with number of paper returned. Abstracts and papers are provided in rank order by relevance to the query. The tool is fully integrated into curation software so citations and abstracts can be automatically entered into the RGD database and given ID and genes and ontology terms in the tags can be checked to create annotations linked to the paper. The system was built with a scalable and open architecture, and literature is updated daily. This tool uses Solr indexing technology and categorizes papers based on a relevance score. It indexes and tags more than 27 million abstracts. With the use of bioNLP tools, RGD has added more automation to its curation workflow.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108198","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In model organism databases, one of the important tasks is to convert free text in biomedical literature to a structured data format. Curators in the Rat Genome Database (RGD), the primary source of rat genomic, genetic, and physiological data, spend considerable time and effort curating functional information for genes, QTLs, and strains from the literature. To increase curation efficiency and prioritize literature for data extraction OntoMate was developed at RGD. This tool tags Pubmed abstracts with genes, gene names, gene mutations, organism name and terms from 16 ontologies/vocabularies, including synonyms and aliases, used to represent functional information. In this project, we have used an unsupervised tagging method to reduce human effort for creating training data. In this approach, a machine learning tool based on decision tree classification techniques has been developed. Mentions that are uniquely belong to a semantic type play positive sample roles, and those with semantic types other than desired group are assumed to be negative samples. An interface allows the user to create a complex query incorporating terms from any of the ontologies, gene symbols, organisms, dates and other parameters. The results return abstracts along with all tagged parameters indicated in the query, along with children of the ontology terms chosen. Results can be further filtered by the user through a panel that lists organisms, genes and diseases with number of paper returned. Abstracts and papers are provided in rank order by relevance to the query. The tool is fully integrated into curation software so citations and abstracts can be automatically entered into the RGD database and given ID and genes and ontology terms in the tags can be checked to create annotations linked to the paper. The system was built with a scalable and open architecture, and literature is updated daily. This tool uses Solr indexing technology and categorizes papers based on a relevance score. It indexes and tags more than 27 million abstracts. With the use of bioNLP tools, RGD has added more automation to its curation workflow.
基于大鼠基因组数据库文本标注工具(OntoMate)的新型无监督命名实体识别
在模式生物数据库中,将生物医学文献中的自由文本转换为结构化数据格式是一个重要任务。大鼠基因组数据库(RGD)是大鼠基因组、遗传和生理数据的主要来源,管理员花费大量时间和精力从文献中整理基因、qtl和菌株的功能信息。为了提高文献整理效率,优先考虑文献的数据提取,RGD开发了OntoMate。这个工具用基因、基因名称、基因突变、生物体名称和来自16个本体/词汇表(包括同义词和别名)的术语标记Pubmed摘要,用于表示功能信息。在这个项目中,我们使用了一种无监督标记方法来减少人工创建训练数据的工作量。在这种方法中,开发了一种基于决策树分类技术的机器学习工具。唯一属于一种语义类型的提及起积极样本作用,而那些不属于期望组的语义类型的提及被假设为负样本。一个界面允许用户创建一个复杂的查询,包含来自任何本体、基因符号、生物体、日期和其他参数的术语。结果返回摘要以及查询中指示的所有标记参数,以及所选本体术语的子术语。用户可以通过一个面板进一步筛选结果,该面板列出了生物体、基因和疾病以及返回的纸张数量。摘要和论文按与查询的相关性排序。该工具完全集成到管理软件中,因此引文和摘要可以自动输入RGD数据库,并且可以检查标签中的ID和基因和本体术语,以创建链接到论文的注释。该系统采用可扩展和开放的架构,并且每天更新文献。该工具使用Solr索引技术,并根据相关性评分对论文进行分类。它对2700多万篇摘要进行了索引和标记。通过使用bioNLP工具,RGD在其策展工作流程中增加了更多的自动化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信