Using natural language processing and the gene ontology to populate a structured pathway database

David Dehoney, R. Harte, Yan Lu, Daniel Chin
{"title":"Using natural language processing and the gene ontology to populate a structured pathway database","authors":"David Dehoney, R. Harte, Yan Lu, Daniel Chin","doi":"10.1109/CSB.2003.1227433","DOIUrl":null,"url":null,"abstract":"Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and scientists currently have a choice between buying access to externally curated pathway databases or building their own in house. However such databases are either expensive to license or slow to populate manually. Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the gene ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER (http://www. cs. utexas. edu/users/mUrapier. html). RAPIER was modified to learn rules from tagged documents, and then it was trained on a corpus tagged by expert curators. The resulting rules were used to extract information from a test corpus automatically. Extracted genes and proteins were mapped onto Locuslink, and extracted interactions were mapped onto GO. Once information was structured in this way it was stored in a pathway database and this formal structure allowed us to perform advanced data mining and visualization.","PeriodicalId":147883,"journal":{"name":"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSB.2003.1227433","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and scientists currently have a choice between buying access to externally curated pathway databases or building their own in house. However such databases are either expensive to license or slow to populate manually. Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the gene ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER (http://www. cs. utexas. edu/users/mUrapier. html). RAPIER was modified to learn rules from tagged documents, and then it was trained on a corpus tagged by expert curators. The resulting rules were used to extract information from a test corpus automatically. Extracted genes and proteins were mapped onto Locuslink, and extracted interactions were mapped onto GO. Once information was structured in this way it was stored in a pathway database and this formal structure allowed us to perform advanced data mining and visualization.
利用自然语言处理和基因本体构建结构化的路径数据库
阅读文献是一个忙碌的科学家不得不应付的最耗时的任务之一。随着文献量的不断增长,有必要以更有效的方式对这些信息进行分类。绘制感兴趣的基因和蛋白质的途径是一个需要经常参考文献的目标。路径数据库可以在这方面提供帮助,科学家目前可以选择购买外部管理的路径数据库的访问权限,或者自己建立自己的路径数据库。然而,这样的数据库要么许可成本很高,要么手动填充速度很慢。基于易于获得的开源工具,我们已经开发了一个管道来自动收集、构建和存储基因和蛋白质相互作用数据。作为一个由生物学家和计算机科学家组成的团队,我们将自然语言处理(NLP)软件与基因本体(GO)集成在一起,收集非结构化文本数据并将其转换为结构化交互数据。对于NLP,我们使用了机器学习方法和规则归纳程序,RAPIER (http://www)。cs。utexas。edu/users/mUrapier。html)。RAPIER被修改为从标记的文档中学习规则,然后在由专家策展人标记的语料库上进行训练。结果规则被用于自动从测试语料库中提取信息。提取的基因和蛋白质被映射到Locuslink上,提取的相互作用被映射到GO上。一旦信息以这种方式结构化,它就被存储在路径数据库中,这种正式的结构允许我们执行高级数据挖掘和可视化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信