Data and Text Mining in Bioinformatics最新文献_第2页

Grounded Feature Selection for Biomedical Relation Extraction by the Combinative Approach 基于组合方法的生物医学关系提取接地特征选择

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665975

S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song

引用次数: 10

Integrative Database for Exploring Compound Combinations of Natural Products for Medical Effects 为医学效果探索天然产物化合物组合的综合数据库

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665986

Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee

引用次数: 0

Mining Context-Specific Rules from the Literature for Virtual Human Model Simulation 从文献中挖掘上下文特定规则用于虚拟人体模型仿真

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665987

Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee

{"title":"Mining Context-Specific Rules from the Literature for Virtual Human Model Simulation","authors":"Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee","doi":"10.1145/2665970.2665987","DOIUrl":"https://doi.org/10.1145/2665970.2665987","url":null,"abstract":"Computer-based virtual human model is believed to be the promising solution for drug response identification. Literature mining is competitive method to extract those biological rules for human model simulation, since existing public databases provide only limited amount of information applicable for the simulation. Here we propose the method for mining context-specific rules from the literature, for future application to virtual human model simulation. Integrating the existing biological databases, we have constructed formalized ontology. From the PubMed literature, we have tagged 11 distinct types of biological entities using both of conditional random field (CRF) and dictionary based Named Entity Recognition (NER). Recognized named entities were normalized and mapped to formalized ontology. Context-specific biological rules between named entities, characterized by increase/decrease features, were extracted by pattern-based method utilizing regular expression. As the result, we have obtained the organ-context specific biological rules. Further researches on enhanced rule and context extraction will be followed.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Display of Conceptual Structures in the Epidemiologic Literature 流行病学文献中的概念结构展示

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665983

E. H. Kim, S. Song, Yonghwan Kim, Min Song

{"title":"A Display of Conceptual Structures in the Epidemiologic Literature","authors":"E. H. Kim, S. Song, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665983","DOIUrl":"https://doi.org/10.1145/2665970.2665983","url":null,"abstract":"Biomedical literature from PubMed contains various types of entities such as diseases or organisms. The rapid growth of their size makes it harder to conceptualized; however, displaying the natural terms that occurred in the text is more effective in understanding the target corpus to be searched than suggesting a concept related to a user query. Thus, we consider the natural common words that biomedical information users actually write and speak. We extract bio-related terms from the corpus mapping with the UMLS. We show entity-based networks with natural language terms as they are shown in the text. In this paper, we present simple and precise associative networks of natural terms in the biomedical literature. The entity-based networks and entity relations can make understanding the biomedical literature corpus more effective and easier by detecting related terms and their hidden relations in the documents. We considered bio-entities and their relations in the biomedical literature and focused on the representation of a graphic display that can improve users' perception about a large corpus. To this end, epidemiology as an experimental domain was chosen and we extract entities from the corpus mapping the UMLS and draw their relations inferred by the Semantic Network of the UMLS. Then we calculate term frequencies, co-occurrences, and term pair similarities (See Figure 1). In results, distinguished networks that display conceptual structures in the biomedical literature with a natural language and not a concept were demonstrated (See Figure 2). The networks we present provide more comprehension of the biomedical collection.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132066107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Identifying Cancer Subtypes based on Somatic Mutation Profile 基于体细胞突变谱识别癌症亚型

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665980

Sungchul Kim, Lee Sael, Hwanjo Yu

引用次数: 10

Systematic Identification of Context-dependent Conflicting Information in Biological Pathways 生物通路中情境依赖性冲突信息的系统识别

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665973

Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee

{"title":"Systematic Identification of Context-dependent Conflicting Information in Biological Pathways","authors":"Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665973","DOIUrl":"https://doi.org/10.1145/2665970.2665973","url":null,"abstract":"Interactions between biological entities such as genes, proteins and metabolites, so called pathways, are key features to understand molecular mechanisms of life. As pathway information is being accumulated rapidly through various knowledge resources, there are growing interests in maintaining integrity of the heterogeneous databases. Here, we defined conflict as a status where two contradictory evidences (i.e. 'A increases B' and 'A decreases B') coexist in a same pathway. This conflict damages unity so that inference of simulation on the integrated pathway network might be unreliable. We defined rule and rule group. A rule consists of interaction of two entities, meta-relation (increase or decrease), and contexts terms about tissue specificity or environmental conditions. The rules, which have the same interaction, are grouped into a rule group. If the rules don't have unanimous meta-relation, the rule group and the rules are judged as being conflicting. This analysis revealed that almost 20% of known interactions suffer from conflicting information and conflicting information occurred much more frequently in the literatures than the public database. With consideration for dual functions depending on context, we thought it might resolve conflict to consider context. We grouped rules, which have the same context terms as well as interaction. It's revealed that up to 86% of the conflicts could be resolved by considering context. Subsequent analysis also showed that those contradictory records generally compete each other closely, but some information might be suspicious when their evidence levels are seriously imbalanced. By identifying and resolving the conflicts, we expect that pathway databases can be cleaned and used for better secondary analyses such as gene/protein annotation, network dynamics and qualitative/quantitative simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130130017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Injury Narrative Text Classification: A Preliminary Study 伤害叙事文本分类的初步研究

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665976

Lin Chen, K. Vallmuur, R. Nayak

{"title":"Injury Narrative Text Classification: A Preliminary Study","authors":"Lin Chen, K. Vallmuur, R. Nayak","doi":"10.1145/2665970.2665976","DOIUrl":"https://doi.org/10.1145/2665970.2665976","url":null,"abstract":"Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000.\u0000 The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built.\u0000 In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134370820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Construction of Multi-level Networks Incorporating Molecule, Cell, Organ and Phenotype Properties for Drug-induced Phenotype Prediction 结合分子、细胞、器官和表型特性的多层次网络构建用于药物诱导表型预测

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665989

J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee

{"title":"Construction of Multi-level Networks Incorporating Molecule, Cell, Organ and Phenotype Properties for Drug-induced Phenotype Prediction","authors":"J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665989","DOIUrl":"https://doi.org/10.1145/2665970.2665989","url":null,"abstract":"Inferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models that are mainly based on a single cell or a single organ model are thought to be limited because the phenotypes are consequences of stochastic biochemical processes among distant cells/organs as well as molecules confined in one cell. Therefore, there is an urgent demand for a new computational model that represents heterogeneous biochemical interactions spanning the entire human body. To meet the demand, we constructed multi-level networks that incorporate previously uncovered high-level properties such as molecules, cells, organs, and phenotypes. Currently, the networks consist of 1,776,506 edges including molecular networks within 76 pre-defined cell-types, inter-cell interactions among the cell-types, and gene (protein) relations to 429 phenotypes. We are also planning to verify if known drug-induced phenotypes are reproducible in the networks using a Petri-net based simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Biomedical Named Entity Recognition Based on the Combination of Regional and Global Text Features 基于区域与全局文本特征结合的生物医学命名实体识别

Data and Text Mining in Bioinformatics Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665990

Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song

{"title":"Biomedical Named Entity Recognition Based on the Combination of Regional and Global Text Features","authors":"Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song","doi":"10.1145/2665970.2665990","DOIUrl":"https://doi.org/10.1145/2665970.2665990","url":null,"abstract":"The biomedical information extraction, especially Named Entity Recognition (NER), is a primary task in biomedical text-mining due to the rapid growth of large-scale literature. Extracting biomedical entities aims at identifying specific entities (words or phrases) from those unstructured text data. In this work, we introduce a novel biomedical NER system utilizing a combination of regional and global text features: linguistic, lexical, contextual, and syntactic features. Our system adopts Conditional Random Fields (CRFs) [1] as a machine learning algorithm and consists of two major pipelines (see Figure 1). We especially focus on constructing the first pipeline for text processing in a modularized manner and discovering rich feature sets regarding comprehensive linguistics and contexts. To implement the CRF framework in the second pipeline, our system uses a modified version of Mallet [2] to take advantage of feature induction. As a result of 10-fold cross-validation, our system achieves from 0.99% up to 18.47% of F-measure improvement as well as the highest precision compared to existing open-source biomedical NER systems on GENETAG corpus [3]. We figure out that several components such as abundant key features, external resources, and feature induction contribute to the performance of the proposed system.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A prototype application for real-time recognition and disambiguation of clinical abbreviations 临床缩略语的实时识别和消歧的原型应用

Data and Text Mining in Bioinformatics Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512096

Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu

{"title":"A prototype application for real-time recognition and disambiguation of clinical abbreviations","authors":"Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu","doi":"10.1145/2512089.2512096","DOIUrl":"https://doi.org/10.1145/2512089.2512096","url":null,"abstract":"To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems \"post-process\" notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist. In this paper, authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (CARD) -- i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The CARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. The prototype application embodies three word sense disambiguation (WSD) methods. We evaluated the accuracy and response times of the prototype CARD system in a simulated study. Using an existing test data set of 25 commonly observed, highly ambiguous clinical abbreviations the evaluation demonstrated that the best WSD method had an accuracy of 88.8%, and a reasonable average response time of 1.6 milliseconds per each abbreviation. The study indicates potential feasibility of real-time NLP-enabled abbreviation disambiguation within clinical documentation systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130790124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12