对生物医学文献中相关文档的提取进行细化,建立路径文本挖掘的语料库

R. Harte, Yan Lu, Stephen Osborn, David Dehoney, Daniel Chin
{"title":"对生物医学文献中相关文档的提取进行细化,建立路径文本挖掘的语料库","authors":"R. Harte, Yan Lu, Stephen Osborn, David Dehoney, Daniel Chin","doi":"10.1109/CSB.2003.1227432","DOIUrl":null,"url":null,"abstract":"For biologists to keep up with developments in their field or related fields, automation is desirable to more efficiently read and interpret a rapidly growing literature. Identification of proteins or genes and their interactions can facilitate the mapping of canonical or evolving pathways from the literature. In order to mine such data, we developed procedures and tools to pre-qualify documents for further analysis. Initially, a corpus of documents for proteins of interest was built using alternate symbols from Locuslink and the Stanford SOURCE as MEDLINE search terms. The query was refined using the optimum keywords together with MeSH terms combined in a Boolean query to minimize false positives. The document space was examined using a strategy employing; latent semantic indexing (LSI), which uses Entrez's \"related papers\" utility for MEDLINE. Documents' relationships were visualized using an undirected graph and scored by their relatedness. Distinct document clusters, formed by the most highly connected related papers, are mostly composed of abstracts relating to one aspect of research. This feature was used to filter irrelevant abstracts, which resulted in a reduction in corpus size of 10% to 30% depending on the domain. The excluded documents were examined to confirm their lack of relevance. Corpora consisted of the most relevant documents thus reducing the number of false positives and irrelevant examples in the training set for pathway mapping. Documents were tagged, using a modified version of GATE2, with terms based on GO for rule induction using RAPIER.","PeriodicalId":147883,"journal":{"name":"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Refining the extraction of relevant documents from biomedical literature to create a corpus for pathway text mining\",\"authors\":\"R. Harte, Yan Lu, Stephen Osborn, David Dehoney, Daniel Chin\",\"doi\":\"10.1109/CSB.2003.1227432\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For biologists to keep up with developments in their field or related fields, automation is desirable to more efficiently read and interpret a rapidly growing literature. Identification of proteins or genes and their interactions can facilitate the mapping of canonical or evolving pathways from the literature. In order to mine such data, we developed procedures and tools to pre-qualify documents for further analysis. Initially, a corpus of documents for proteins of interest was built using alternate symbols from Locuslink and the Stanford SOURCE as MEDLINE search terms. The query was refined using the optimum keywords together with MeSH terms combined in a Boolean query to minimize false positives. The document space was examined using a strategy employing; latent semantic indexing (LSI), which uses Entrez's \\\"related papers\\\" utility for MEDLINE. Documents' relationships were visualized using an undirected graph and scored by their relatedness. Distinct document clusters, formed by the most highly connected related papers, are mostly composed of abstracts relating to one aspect of research. This feature was used to filter irrelevant abstracts, which resulted in a reduction in corpus size of 10% to 30% depending on the domain. The excluded documents were examined to confirm their lack of relevance. Corpora consisted of the most relevant documents thus reducing the number of false positives and irrelevant examples in the training set for pathway mapping. Documents were tagged, using a modified version of GATE2, with terms based on GO for rule induction using RAPIER.\",\"PeriodicalId\":147883,\"journal\":{\"name\":\"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSB.2003.1227432\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSB.2003.1227432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

对于生物学家来说,为了跟上他们所在领域或相关领域的发展,自动化是更有效地阅读和解释快速增长的文献所需要的。鉴定蛋白质或基因及其相互作用有助于从文献中绘制规范或进化途径。为了挖掘这些数据,我们开发了程序和工具来预先确定文件的资格,以便进一步分析。最初,使用Locuslink和Stanford SOURCE的替代符号作为MEDLINE搜索词,建立了感兴趣的蛋白质文档语料库。使用最优关键字和MeSH术语组合在布尔查询中对查询进行了细化,以最大限度地减少误报。使用以下策略检查文档空间;潜在语义索引(LSI),它使用了Entrez的“相关论文”MEDLINE工具。文档的关系使用无向图进行可视化,并根据它们的相关性进行评分。不同的文献群,由联系最紧密的相关论文组成,大多由与研究的一个方面有关的摘要组成。这个特征被用来过滤不相关的摘要,这导致语料库大小减少10%到30%,具体取决于领域。对被排除在外的文件进行了检查,以确认它们缺乏相关性。语料库由最相关的文档组成,从而减少了路径映射训练集中误报和不相关示例的数量。使用GATE2的修改版本标记文档,使用基于GO的术语使用RAPIER进行规则归纳。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Refining the extraction of relevant documents from biomedical literature to create a corpus for pathway text mining
For biologists to keep up with developments in their field or related fields, automation is desirable to more efficiently read and interpret a rapidly growing literature. Identification of proteins or genes and their interactions can facilitate the mapping of canonical or evolving pathways from the literature. In order to mine such data, we developed procedures and tools to pre-qualify documents for further analysis. Initially, a corpus of documents for proteins of interest was built using alternate symbols from Locuslink and the Stanford SOURCE as MEDLINE search terms. The query was refined using the optimum keywords together with MeSH terms combined in a Boolean query to minimize false positives. The document space was examined using a strategy employing; latent semantic indexing (LSI), which uses Entrez's "related papers" utility for MEDLINE. Documents' relationships were visualized using an undirected graph and scored by their relatedness. Distinct document clusters, formed by the most highly connected related papers, are mostly composed of abstracts relating to one aspect of research. This feature was used to filter irrelevant abstracts, which resulted in a reduction in corpus size of 10% to 30% depending on the domain. The excluded documents were examined to confirm their lack of relevance. Corpora consisted of the most relevant documents thus reducing the number of false positives and irrelevant examples in the training set for pathway mapping. Documents were tagged, using a modified version of GATE2, with terms based on GO for rule induction using RAPIER.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信