Rohit Kugaonkar, A. Gangopadhyay, Y. Yesha, A. Joshi, Y. Yesha, M. Grasso, Mary Brady, N. Rishe
{"title":"Finding associations among SNPS for prostate cancer using collaborative filtering","authors":"Rohit Kugaonkar, A. Gangopadhyay, Y. Yesha, A. Joshi, Y. Yesha, M. Grasso, Mary Brady, N. Rishe","doi":"10.1145/2390068.2390080","DOIUrl":"https://doi.org/10.1145/2390068.2390080","url":null,"abstract":"Prostate cancer is the second leading cause of cancer related deaths among men. Because of the slow growing nature of prostate cancer, sometimes surgical treatment is not required for less aggressive cancers. Recent debates over prostate-specific antigen (PSA) screening have drawn new attention to prostate cancer. Genome-based screening can potentially help in assessing the risk of developing prostate cancer. Due to the complicated nature of prostate cancer, studying the entire genome is essential to find genomic traits. Due to the high cost of studying all Single Nucleotide Polymorphisms (SNPs), it is essential to find tag SNPs which can represent other SNPs. Earlier methods to find tag SNPs using associations between SNPs either use SNP's location information or are based on data of very few SNP markers in each sample. Our study is based on 2300 samples with 550,000 SNPs each. We have not used SNP location information or any predefined standard cut-offs to find tag SNPs. Our approach is based on using collaborative filtering methods to find pairwise associations among SNPs and thus list top-N tag SNPs. We have found 25 tag SNPs which have highest similarities to other SNPs. In addition we found 16 more SNPs which have high correlation with the known high risk SNPs that are associated with prostate cancer. We used some of these newly found SNPs with 5 different classification algorithms and observed some improvement in prostate cancer prediction accuracy over using the original known high risk SNPs.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114446130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration","authors":"Chiyong Kang, Hyeji Yu, G. Yi","doi":"10.1145/2390068.2390070","DOIUrl":"https://doi.org/10.1145/2390068.2390070","url":null,"abstract":"The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133699051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang
{"title":"High precision rule based PPI extraction and per-pair basis performance evaluation","authors":"Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang","doi":"10.1145/2390068.2390082","DOIUrl":"https://doi.org/10.1145/2390068.2390082","url":null,"abstract":"Virtually all current PPI extraction studies focus on improving F-score, aiming to balance the performance on both precision and recall. However, in many realistic scenarios involving large corpora, one can benefit more from an extremely high precision PPI extraction tool than a high-recall counterpart. We also argue that the current \"per-instance\" basis performance evaluation method should be revisited. In order to address these problems, we introduce a new rule-based PPI extraction method equipped with a set of ultra-high precision extraction rules. We also propose a new \"per-pair\" basis performance metric, which is more pragmatic in practice. The proposed PPI extraction method achieves 95-96% per-pair and 94-97% per-instance precisions on the AIMed benchmark corpus.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128794604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaejoon Choi, Kwangmin Kim, Min-Keun Song, Doheon Lee
{"title":"TNMCA: generation and application of network motif based inference models for drug repositioning","authors":"Jaejoon Choi, Kwangmin Kim, Min-Keun Song, Doheon Lee","doi":"10.1145/2390068.2390081","DOIUrl":"https://doi.org/10.1145/2390068.2390081","url":null,"abstract":"Since the increase of the public biomedical data, Undiscovered Public Knowledge (UPK, proposed by Swanson) became an important research topic in the biological field. Drug repositioning is one of famous UPK tasks which infer alternative indications for approved drugs. Many researchers tried to find novel candidates of existing drugs, but these previous works are not fully automated which required manual modulations to desired tasks, and was not able to cover various biomedical entities. In addition, they had inference limitations that those works could infer only pre-defined cases using limited patterns. In this paper, we propose the Typed Network Motif Comparison Algorithm (TNMCA) to discover novel drug indications using topological patterns of data. Typed network motifs (TNM) are connected sub-graphs of data, which store types of data, instead of values of data. While previous researches depends on ABC model (or extension of it), TNMCA utilizes more generalized patterns as its inference models. Also, TNMCA can infer not only an existence of interaction, but also the type of the interaction. TNMCA is suited for multi-level biomedical interaction data as TNMs depend on the different types of entities and relations. We apply TNMCA to a public database, Comparative Toxicogenomics Database (CTD), to validate our method. The results show that TNMCA could infer meaningful indications with high performance (AUC=0.7469) compared to the ABC model (AUC=0.7050).","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote address","authors":"Doheon Lee","doi":"10.1145/3260180","DOIUrl":"https://doi.org/10.1145/3260180","url":null,"abstract":"","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126386633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinoh Oh, Taehoon Kim, Sun Park, Wook-Shin Han, Hwanjo Yu
{"title":"Dynamic concept ontology construction for pubmed queries","authors":"Jinoh Oh, Taehoon Kim, Sun Park, Wook-Shin Han, Hwanjo Yu","doi":"10.1145/1871871.1871885","DOIUrl":"https://doi.org/10.1145/1871871.1871885","url":null,"abstract":"Exploring PubMed to find relevant information is challenging and time-consuming, as PubMed typically returns a large list of articles as a result of query. Existing works in improving the search quality on PubMed have focused on helping PubMed query formulation, clustering the results, or ranking by relevance. This paper proposes a novel system that dynamically constructs a concept ontology based on the search results, which visualizes related concepts to the query in the form of ontology. The concept ontology can make the PubMed search more effective by detecting related concepts and their relation hidden in the documents. The ontology can broaden the user's knowledge by recommending new concepts unexpected by the user, and also serves to narrow down the search results by recommending additional query terms. The ontology construction is processed in real-time as a result of query, integrated within our PubMed search engine called RefMED. Our system is accesible at \"http://dm.hwanjoyu.org/refmed\".","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134632303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isabel Segura-Bedmar, Mario Crespo, César de Pablo-Sánchez, Paloma Martínez
{"title":"DrugNerAR: linguistic rule-based anaphora resolver for drug-drug interaction extraction in pharmacological documents","authors":"Isabel Segura-Bedmar, Mario Crespo, César de Pablo-Sánchez, Paloma Martínez","doi":"10.1145/1651318.1651324","DOIUrl":"https://doi.org/10.1145/1651318.1651324","url":null,"abstract":"DrugNerAR, a drug anaphora resolution system is presented to address the problem of co-referring expressions in pharmacological literature. This development is part of a larger and innovative study about automatic drug-drug interaction extraction. Besides, a corpus has been developed in order to analyze the phenomena and evaluate the current approach. The system uses a set of linguistic rules inspired by Centering Theory over the analysis provided by a biomedical syntactic parser. Semantic information provided by Unified Medical Language System (UMLS) is also integrated in order to improve the recognition and the resolution of nominal drug anaphors. This linguistic rule-based approach shows very promising results for the challenge of accounting for anaphoric expressions in pharmacological texts.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125945295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining cancer genes with running-sum statistics","authors":"Inho Park, Kwang-H. Lee, Doheon Lee","doi":"10.1145/1651318.1651326","DOIUrl":"https://doi.org/10.1145/1651318.1651326","url":null,"abstract":"In this paper, we propose a new method to detect candidate cancer genes for developing molecular biomarkers or therapeutic targets from cancer microarray datasets. To resolve problems resulted in the molecular heterogeneity of cancers on gene prioritizing, our proposed method is intended to identify genes that are over- or down- expressed not in the whole cancer samples but also in a subgroup of cancer samples. To this end, we propose the RS score for gene ranking calculated with a weighted running sum statistic on the ordered list of expression values of each gene. We apply the proposed method to publically available prostate cancer microarray datasets, showing that it can identify previously well known prostate cancer associated genes such as ERG, HPN, and AMACR at the top of the list of candidate genes. Embedding samples, represented as vectors of the expression values of the top 20 genes, into a two dimensional space using the commute time embedding shows the distinction between normal samples and cancer samples in the independent test datasets as well as in the training datasets. We further evaluate the proposed method by estimating classification performance on the independent test datasets, and it shows the better classification performance compared to the other cancer outlier profile approaches.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114860664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Prabhu, S. Navathe, Stephen Tyler, V. Dasigi, N. Narkhede, Balaji Palanisamy
{"title":"LITSEEK: public health literature search by metadata enhancement with external knowledge bases","authors":"P. Prabhu, S. Navathe, Stephen Tyler, V. Dasigi, N. Narkhede, Balaji Palanisamy","doi":"10.1145/1651318.1651337","DOIUrl":"https://doi.org/10.1145/1651318.1651337","url":null,"abstract":"Biomedical literature is an important source of information in any researcher's investigation of genes, risk factors, diseases and drugs. Often the information searched by public health researchers is distributed across multiple disparate sources that may include publications from PubMed, genomic, proteomic and pathway databases, gene expression and clinical resources and biomedical ontologies. The unstructured nature of this information makes it difficult to find relevant parts from it manually and comprehensive knowledge is further difficult to synthesize automatically. In this paper we report on LITSEEK (LITerature Search by metadata Enhancement with External Knowledgebases), a system we have developed for the benefit of researchers at the Centers for Disease Control (CDC) to enable them to search the HuGE (Human Genome for Epidemiology) database of PubMed articles, from a pharmacogenomic perspective. Besides analyzing text using TFIDF ranking and indexing of the important terms, the proposed system incorporates an automatic consultation with PharmGKB - a human-curated knowledge base about drugs, related diseases and genes, as well as with the Gene Ontology, a human-curated, well accepted ontology. We highlight the main components of our approach and illustrate how the search is enhanced by incorporating additional concepts in terms of genes/drugs/diseases (called metadata for ease of reference) from PharmGKB. Various measurements are reported with respect to the addition of these metadata terms. Preliminary results in terms of precision based on expert user feedback from CDC are encouraging. Further evaluation of the search procedure by actual researchers is under way.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125181432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarvnaz Karimi, J. Zobel, Stefan Pohl, Falk Scholer
{"title":"The challenge of high recall in biomedical systematic search","authors":"Sarvnaz Karimi, J. Zobel, Stefan Pohl, Falk Scholer","doi":"10.1145/1651318.1651338","DOIUrl":"https://doi.org/10.1145/1651318.1651338","url":null,"abstract":"Clinical systematic reviews are based on expert, laborious search of well-annotated literature. Boolean search on bibliographic databases, such as MEDLINE, continues to be the preferred discovery method, but the size of these databases, now approaching 20 million records, makes it impossible to fully trust these searching methods. We are investigating the trade-offs between Boolean and ranked retrieval. Our findings show that although Boolean search has limitations, it is not obvious that ranking is superior, and illustrate that a single query cannot be used to resolve an information need. Our experiments show that a combination of less complicated Boolean queries and ranked retrieval outperforms either of them individually, leading to possible time savings over the current process.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115468500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}