Data and Text Mining in Bioinformatics最新文献

筛选
英文 中文
Knowledge-based gene symbol disambiguation 基于知识的基因符号消歧
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458466
He Tan
{"title":"Knowledge-based gene symbol disambiguation","authors":"He Tan","doi":"10.1145/1458449.1458466","DOIUrl":"https://doi.org/10.1145/1458449.1458466","url":null,"abstract":"Since there is no standard naming convention for genes and gene products, gene symbol disambiguation (GSD) has become a big challenge when mining biomedical literature. Several GSD methods have been proposed based on MEDLINE references to genes. However, nowadays gene databases, e.g. Entrez Gene, provide plenty of information about genes, and many biomedical ontologies, e.g. UMLS Metathesaurus and Semantic Network, have been developed. These knowledge sources could be used for disambiguation, in this paper we propose a method which relies on information about gene candidates from gene databases, contexts of gene symbols and biomedical ontologies. We implement our method, and evaluate the performance of the implementation using BioCreAtIvE II data sets.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122035338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Text mining in genomics and systems biology 基因组学和系统生物学中的文本挖掘
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458453
A. Valencia
{"title":"Text mining in genomics and systems biology","authors":"A. Valencia","doi":"10.1145/1458449.1458453","DOIUrl":"https://doi.org/10.1145/1458449.1458453","url":null,"abstract":"There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128948700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation 利用塌陷变分潜狄利克雷分配从文献中预测蛋白质与蛋白质的关系
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458467
Tatsuya Asou, K. Eguchi
{"title":"Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation","authors":"Tatsuya Asou, K. Eguchi","doi":"10.1145/1458449.1458467","DOIUrl":"https://doi.org/10.1145/1458449.1458467","url":null,"abstract":"This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model, and compared them from the viewpoints of log-likelihoods, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the other, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115193288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Biological pathways as features for microarray data classification 生物通路作为微阵列数据分类的特征
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458455
Brian Quanz, Meeyoung Park, Jun Huan
{"title":"Biological pathways as features for microarray data classification","authors":"Brian Quanz, Meeyoung Park, Jun Huan","doi":"10.1145/1458449.1458455","DOIUrl":"https://doi.org/10.1145/1458449.1458455","url":null,"abstract":"Classification using microarray gene expression data is an important task in bioinformatics. Due to the high dimensionality and small sample size that characterizes microarray data, there has recently been a drive to incorporate any available information in addition to the expression data in the classification process. As a result, much work has begun on selecting biological pathways that are closely related to a clinical outcome of interest using the gene expression data, and incorporating this pathway information opens up new avenues for classification. As opposed to previous approaches that consider individual genes as features, we propose a new approach that treats biological pathways as features. Each pathway found to be significantly related to an outcome of interest is treated as a feature, and is mapped to a feature value. We define several methods for mapping pathways to features, and compare the performance of several classifiers using our feature transformations to that of the classifiers using individual genes as features for different feature selection methods.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125315768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A data integration method for exploring gene regulatory mechanisms 研究基因调控机制的数据集成方法
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458468
Jane Synnergren, B. Olsson, Jonas Gamalielsson
{"title":"A data integration method for exploring gene regulatory mechanisms","authors":"Jane Synnergren, B. Olsson, Jonas Gamalielsson","doi":"10.1145/1458449.1458468","DOIUrl":"https://doi.org/10.1145/1458449.1458468","url":null,"abstract":"Systems biology aims to understand the behavior of and interaction between various components of the living cell, such as genes, proteins, and metabolites. A large number of components are involved in these complex systems and the diversity of relationships between the components can be overwhelming, and there is therefore a need for analysis methods incorporating data integration. We here present a method for exploring gene regulatory mechanisms which integrates various types of data to assist the identification of important components in gene regulation mechanisms. By first analyzing gene expression data, a set of differentially expressed genes is selected. These genes are then further investigated by combining various types of biological information, such as clustering results, promoter sequences, binding sites, transcription factors and other previously published information regarding the selected genes. Inspired by Information Fusion research, we also mapped functions of the proposed method to the well-known OODA-model to facilitate application of this data integration method in other research communities. We have successfully applied the method to genes identified as differentially expressed in human embryonic stem cells at different stages of differentiation towards cardiac cells. We identified 15 novel motifs that may represent important binding sites in the cardiac cell linage.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127610038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Peptide programs: applying fragment programs to protein classification 肽程序:将片段程序应用于蛋白质分类
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458459
A. O. Falcão, Daniel Faria, António E. N. Ferreira
{"title":"Peptide programs: applying fragment programs to protein classification","authors":"A. O. Falcão, Daniel Faria, António E. N. Ferreira","doi":"10.1145/1458449.1458459","DOIUrl":"https://doi.org/10.1145/1458449.1458459","url":null,"abstract":"Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127221253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Microarray data analysis with PCA in a DBMS 微阵列数据分析与PCA在一个DBMS
Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458456
W. Rinsurongkawong, C. Ordonez
{"title":"Microarray data analysis with PCA in a DBMS","authors":"W. Rinsurongkawong, C. Ordonez","doi":"10.1145/1458449.1458456","DOIUrl":"https://doi.org/10.1145/1458449.1458456","url":null,"abstract":"Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126229565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Identification of temporal association rules from time-series microarray data set: temporal association rules 时间序列微阵列数据集时间关联规则的识别:时间关联规则
Data and Text Mining in Bioinformatics Pub Date : 2008-10-26 DOI: 10.1145/1458449.1458457
Hojung Nam, K. Lee, Doheon Lee
{"title":"Identification of temporal association rules from time-series microarray data set: temporal association rules","authors":"Hojung Nam, K. Lee, Doheon Lee","doi":"10.1145/1458449.1458457","DOIUrl":"https://doi.org/10.1145/1458449.1458457","url":null,"abstract":"One of the most challenging problems in mining gene expression data is to identify how the expression of any particular gene affects the expression of other genes. To elucidate the relationships between genes, an association rule mining (ARM) method has been applied to microarray gene expression data. A conventional ARM method, however, has a limit on extracting temporal dependencies between genes, though the temporal information is indispensable to discover underlying regulation mechanisms in biological pathways. In this paper, therefore, we propose a novel method, referred to as temporal association rule mining (TARM), which can extract temporal dependencies among related genes. A temporal association rule has the form [gene A ↑, gene B↓] → (7 min)[gene C], which represents that high expression level of gene A and significant repression of gene B followed by significant expression of gene C after 7 minutes. The proposed TARM method is tested with Saccharomyces cerevisiae cell cycle time-series microarray gene expression data set. In the parameter fitting phase of TARM, the best parameter set [threshold = ±0.8, support cutoff = 3 transactions, confidence cutoff = 90%], which extracted the most number of correct associations in KEGG cell cycle pathway, has been chosen for rule mining phase. Furthermore, comparing the precision scores of TARM (0.38) and Bayesian network (0.16), TARM method showed better accuracy. With the best parameter set, numbers of temporal association rules with five transcriptional time delays (0, 7, 14, 21, 28 minutes) are extracted from gene expression data of 799 genes which are pre-identified cell cycle relevant genes, while comparably small number of rules are extracted from random shuffled gene expression data of 799 genes. From the extracted temporal association rules, associated genes which play same role of biological processes within short transcriptional time delay and some temporal dependencies between genes with specific biological processes are identified.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126889796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信