S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song
{"title":"Grounded Feature Selection for Biomedical Relation Extraction by the Combinative Approach","authors":"S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665975","DOIUrl":"https://doi.org/10.1145/2665970.2665975","url":null,"abstract":"Relation extraction is an important task in biomedical areas such as protein-protein interaction, gene-disease interactions, and drug-disease interactions. In recent years, it has been widely researched to automatically extract biomedical relations in a vest amount of biomedical text data. In this paper, we propose a hybrid approach to extracting relations based on a rule-based approach feature set. We then use different classification algorithms such as SVM, Naïve Bayes, and Decision Tree classifiers for relation classification. The rationale for adopting shallow parsing and other NLP techniques to extract relations is two-folds: simplicity and robustness. We select seven features with the rule-based shallow parsing technique and evaluate the performance with four different PPI public corpora. Our experimental results show the stable performance in F-measure even with the relatively fewer features.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134620168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee
{"title":"Integrative Database for Exploring Compound Combinations of Natural Products for Medical Effects","authors":"Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee","doi":"10.1145/2665970.2665986","DOIUrl":"https://doi.org/10.1145/2665970.2665986","url":null,"abstract":"Natural products used in dietary supplements, complementary and alternative medicine (CAM) and conventional medicine are composites of multiple chemical compounds. These chemical compounds potentially offer an extensive source for drug discovery with accumulated knowledge of efficacy and safety. However, existing natural product related databases have drawbacks in both standardization and structuralization of information. Therefore, in this work, we construct an integrated database of natural products by mapping the prescription, herb, compound, and phenotype information to international identifiers and structuralizing the efficacy information through database integration and text-mining methods. We expect that the constructed database could serve as a fundamental resource for the natural products research.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115450589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee
{"title":"Mining Context-Specific Rules from the Literature for Virtual Human Model Simulation","authors":"Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee","doi":"10.1145/2665970.2665987","DOIUrl":"https://doi.org/10.1145/2665970.2665987","url":null,"abstract":"Computer-based virtual human model is believed to be the promising solution for drug response identification. Literature mining is competitive method to extract those biological rules for human model simulation, since existing public databases provide only limited amount of information applicable for the simulation. Here we propose the method for mining context-specific rules from the literature, for future application to virtual human model simulation. Integrating the existing biological databases, we have constructed formalized ontology. From the PubMed literature, we have tagged 11 distinct types of biological entities using both of conditional random field (CRF) and dictionary based Named Entity Recognition (NER). Recognized named entities were normalized and mapped to formalized ontology. Context-specific biological rules between named entities, characterized by increase/decrease features, were extracted by pattern-based method utilizing regular expression. As the result, we have obtained the organ-context specific biological rules. Further researches on enhanced rule and context extraction will be followed.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Display of Conceptual Structures in the Epidemiologic Literature","authors":"E. H. Kim, S. Song, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665983","DOIUrl":"https://doi.org/10.1145/2665970.2665983","url":null,"abstract":"Biomedical literature from PubMed contains various types of entities such as diseases or organisms. The rapid growth of their size makes it harder to conceptualized; however, displaying the natural terms that occurred in the text is more effective in understanding the target corpus to be searched than suggesting a concept related to a user query. Thus, we consider the natural common words that biomedical information users actually write and speak. We extract bio-related terms from the corpus mapping with the UMLS. We show entity-based networks with natural language terms as they are shown in the text. In this paper, we present simple and precise associative networks of natural terms in the biomedical literature. The entity-based networks and entity relations can make understanding the biomedical literature corpus more effective and easier by detecting related terms and their hidden relations in the documents. We considered bio-entities and their relations in the biomedical literature and focused on the representation of a graphic display that can improve users' perception about a large corpus. To this end, epidemiology as an experimental domain was chosen and we extract entities from the corpus mapping the UMLS and draw their relations inferred by the Semantic Network of the UMLS. Then we calculate term frequencies, co-occurrences, and term pair similarities (See Figure 1). In results, distinguished networks that display conceptual structures in the biomedical literature with a natural language and not a concept were demonstrated (See Figure 2). The networks we present provide more comprehension of the biomedical collection.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132066107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identifying Cancer Subtypes based on Somatic Mutation Profile","authors":"Sungchul Kim, Lee Sael, Hwanjo Yu","doi":"10.1145/2665970.2665980","DOIUrl":"https://doi.org/10.1145/2665970.2665980","url":null,"abstract":"Tumor stratification is one of the basic tasks in cancer genomics for a better understanding of the tumor heterogeneity and better targeted treatments. There are various biological data that can be used to stratify tumors including gene expression and sequencing data. In this work, we use the somatic mutation data. Two types of somatic mutation profiles are generated and clustered using k-means clustering with appropriate distance measures to obtain cancer subtypes for each cancer type: binary somatic mutation profile and weighted somatic mutation profile. According to the predictive power of clinical features and survival time of the identified subtypes, the binary somatic mutation profile with Jaccard distance (B-Jac) performed the best and the weighted somatic mutation profile with Euclidean distance (W-Euc) performed comparably. Both approaches performed significantly better than the typical usage of somatic mutation, i.e. the binary somatic mutation profile with Euclidean distance (B-Euc).","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133384810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee
{"title":"Systematic Identification of Context-dependent Conflicting Information in Biological Pathways","authors":"Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665973","DOIUrl":"https://doi.org/10.1145/2665970.2665973","url":null,"abstract":"Interactions between biological entities such as genes, proteins and metabolites, so called pathways, are key features to understand molecular mechanisms of life. As pathway information is being accumulated rapidly through various knowledge resources, there are growing interests in maintaining integrity of the heterogeneous databases. Here, we defined conflict as a status where two contradictory evidences (i.e. 'A increases B' and 'A decreases B') coexist in a same pathway. This conflict damages unity so that inference of simulation on the integrated pathway network might be unreliable. We defined rule and rule group. A rule consists of interaction of two entities, meta-relation (increase or decrease), and contexts terms about tissue specificity or environmental conditions. The rules, which have the same interaction, are grouped into a rule group. If the rules don't have unanimous meta-relation, the rule group and the rules are judged as being conflicting. This analysis revealed that almost 20% of known interactions suffer from conflicting information and conflicting information occurred much more frequently in the literatures than the public database. With consideration for dual functions depending on context, we thought it might resolve conflict to consider context. We grouped rules, which have the same context terms as well as interaction. It's revealed that up to 86% of the conflicts could be resolved by considering context. Subsequent analysis also showed that those contradictory records generally compete each other closely, but some information might be suspicious when their evidence levels are seriously imbalanced. By identifying and resolving the conflicts, we expect that pathway databases can be cleaned and used for better secondary analyses such as gene/protein annotation, network dynamics and qualitative/quantitative simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130130017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Injury Narrative Text Classification: A Preliminary Study","authors":"Lin Chen, K. Vallmuur, R. Nayak","doi":"10.1145/2665970.2665976","DOIUrl":"https://doi.org/10.1145/2665970.2665976","url":null,"abstract":"Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000.\u0000 The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built.\u0000 In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134370820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee
{"title":"Construction of Multi-level Networks Incorporating Molecule, Cell, Organ and Phenotype Properties for Drug-induced Phenotype Prediction","authors":"J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665989","DOIUrl":"https://doi.org/10.1145/2665970.2665989","url":null,"abstract":"Inferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models that are mainly based on a single cell or a single organ model are thought to be limited because the phenotypes are consequences of stochastic biochemical processes among distant cells/organs as well as molecules confined in one cell. Therefore, there is an urgent demand for a new computational model that represents heterogeneous biochemical interactions spanning the entire human body. To meet the demand, we constructed multi-level networks that incorporate previously uncovered high-level properties such as molecules, cells, organs, and phenotypes. Currently, the networks consist of 1,776,506 edges including molecular networks within 76 pre-defined cell-types, inter-cell interactions among the cell-types, and gene (protein) relations to 429 phenotypes. We are also planning to verify if known drug-induced phenotypes are reproducible in the networks using a Petri-net based simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song
{"title":"Biomedical Named Entity Recognition Based on the Combination of Regional and Global Text Features","authors":"Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song","doi":"10.1145/2665970.2665990","DOIUrl":"https://doi.org/10.1145/2665970.2665990","url":null,"abstract":"The biomedical information extraction, especially Named Entity Recognition (NER), is a primary task in biomedical text-mining due to the rapid growth of large-scale literature. Extracting biomedical entities aims at identifying specific entities (words or phrases) from those unstructured text data. In this work, we introduce a novel biomedical NER system utilizing a combination of regional and global text features: linguistic, lexical, contextual, and syntactic features. Our system adopts Conditional Random Fields (CRFs) [1] as a machine learning algorithm and consists of two major pipelines (see Figure 1). We especially focus on constructing the first pipeline for text processing in a modularized manner and discovering rich feature sets regarding comprehensive linguistics and contexts. To implement the CRF framework in the second pipeline, our system uses a modified version of Mallet [2] to take advantage of feature induction. As a result of 10-fold cross-validation, our system achieves from 0.99% up to 18.47% of F-measure improvement as well as the highest precision compared to existing open-source biomedical NER systems on GENETAG corpus [3]. We figure out that several components such as abundant key features, external resources, and feature induction contribute to the performance of the proposed system.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu
{"title":"A prototype application for real-time recognition and disambiguation of clinical abbreviations","authors":"Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu","doi":"10.1145/2512089.2512096","DOIUrl":"https://doi.org/10.1145/2512089.2512096","url":null,"abstract":"To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems \"post-process\" notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist. In this paper, authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (CARD) -- i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The CARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. The prototype application embodies three word sense disambiguation (WSD) methods. We evaluated the accuracy and response times of the prototype CARD system in a simulated study. Using an existing test data set of 25 commonly observed, highly ambiguous clinical abbreviations the evaluation demonstrated that the best WSD method had an accuracy of 88.8%, and a reasonable average response time of 1.6 milliseconds per each abbreviation. The study indicates potential feasibility of real-time NLP-enabled abbreviation disambiguation within clinical documentation systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130790124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}