Text mining in genomics and systems biology

Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI:10.1145/1458449.1458453

A. Valencia

{"title":"Text mining in genomics and systems biology","authors":"A. Valencia","doi":"10.1145/1458449.1458453","DOIUrl":null,"url":null,"abstract":"There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and Text Mining in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1458449.1458453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.

查看原文本刊更多论文

基因组学和系统生物学中的文本挖掘

在系统生物学和基因组学项目中，对生物系统分析可用信息的补充需求日益增加。使用信息提取和文本挖掘方法直接从文本源中提取信息的集成需求非常有趣。我的团队一直致力于开发文本挖掘方法，并将其与其他实验和生物信息学方法集成到大型项目中。在这个场合，我将介绍与人类有丝分裂纺锤体表征有关的发展，在ENFIN NoE的背景下发展。对于这些和其他应用来说，准确估计当前文本挖掘系统的容量是至关重要的。由CNIO、MITRE和NCBI与MINT和完好无损数据库(http://biocreative.sourceforge.net，基因组生物学，2008年8月特刊)合作组织的BioCreative II挑战赛提供了这样一个概述。BioCreative II有两个任务:1)基因名称识别和规范化，其中许多系统能够达到一致的80%的平衡精度/召回率。2)蛋白质相互作用检测，分为四个子任务:a)根据它们与蛋白质相互作用实验测定的相关性对出版物进行排序，b)检测文本中的蛋白质相互作用伙伴，c)检测描述蛋白质相互作用的关键句子，d)检测用于确定相互作用的实验技术。在出版物排名、实验方法检测和相关句子的突出显示方面，结果相当不错，但它们指出了基因/蛋白质名称正确规范化方面存在的持续问题。此外，BioCreative还引导了几个团队的合作，创建了第一个文本挖掘元服务器(the BioCreative元服务器，Leitner等人，Genome Biology 2008 BioCreative特刊)。我们目前正在筹备BioCreative III，重点是促进文本挖掘系统的创建，这些系统可以集成到基因组分析管道中，并有效地促进对复杂生物系统的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data and Text Mining in Bioinformatics

自引率

0.00%

发文量