Pierre Zweigenbaum, Thomas Lavergne, Natalia Grabar, Thierry Hamon, Sophie Rosset, Cyril Grouin
{"title":"Combining an expert-based medical entity recognizer to a machine-learning system: methods and a case study.","authors":"Pierre Zweigenbaum, Thomas Lavergne, Natalia Grabar, Thierry Hamon, Sophie Rosset, Cyril Grouin","doi":"10.4137/BII.S11770","DOIUrl":"https://doi.org/10.4137/BII.S11770","url":null,"abstract":"<p><p>Medical entity recognition is currently generally performed by data-driven methods based on supervised machine learning. Expert-based systems, where linguistic and domain expertise are directly provided to the system are often combined with data-driven systems. We present here a case study where an existing expert-based medical entity recognition system, Ogmios, is combined with a data-driven system, Caramba, based on a linear-chain Conditional Random Field (CRF) classifier. Our case study specifically highlights the risk of overfitting incurred by an expert-based system. We observe that it prevents the combination of the 2 systems from obtaining improvements in precision, recall, or F-measure, and analyze the underlying mechanisms through a post-hoc feature-level analysis. Wrapping the expert-based system alone as attributes input to a CRF classifier does boost its F-measure from 0.603 to 0.710, bringing it on par with the data-driven system. The generalization of this method remains to be further investigated. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"51-62"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11770","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31747827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mindy K Ross, Ko-Wei Lin, Karen Truong, Abhishek Kumar, Mike Conway
{"title":"Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.","authors":"Mindy K Ross, Ko-Wei Lin, Karen Truong, Abhishek Kumar, Mike Conway","doi":"10.4137/BII.S11987","DOIUrl":"https://doi.org/10.4137/BII.S11987","url":null,"abstract":"<p><p>The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 ","pages":"35-45"},"PeriodicalIF":0.0,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11987","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31641327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using conversation topics for predicting therapy outcomes in schizophrenia.","authors":"Christine Howes, Matthew Purver, Rose McCabe","doi":"10.4137/BII.S11661","DOIUrl":"https://doi.org/10.4137/BII.S11661","url":null,"abstract":"<p><p>Previous research shows that aspects of doctor-patient communication in therapy can predict patient symptoms, satisfaction and future adherence to treatment (a significant problem with conditions such as schizophrenia). However, automatic prediction has so far shown success only when based on low-level lexical features, and it is unclear how well these can generalize to new data, or whether their effectiveness is due to their capturing aspects of style, structure or content. Here, we examine the use of topic as a higher-level measure of content, more likely to generalize and to have more explanatory power. Investigations show that while topics predict some important factors such as patient satisfaction and ratings of therapy quality, they lack the full predictive power of lower-level features. For some factors, unsupervised methods produce models comparable to manual annotation. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"39-50"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11661","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31655624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunghwan Sohn, Cheryl Clark, Scott R Halgrim, Sean P Murphy, Siddhartha R Jonnalagadda, Kavishwar B Wagholikar, Stephen T Wu, Christopher G Chute, Hongfang Liu
{"title":"Analysis of cross-institutional medication description patterns in clinical narratives.","authors":"Sunghwan Sohn, Cheryl Clark, Scott R Halgrim, Sean P Murphy, Siddhartha R Jonnalagadda, Kavishwar B Wagholikar, Stephen T Wu, Christopher G Chute, Hongfang Liu","doi":"10.4137/BII.S11634","DOIUrl":"https://doi.org/10.4137/BII.S11634","url":null,"abstract":"<p><p>A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"7-16"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11634","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31574766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Converting Clinical Phrases into SNOMED CT Expressions.","authors":"Rohit J Kate","doi":"10.4137/BII.S11645","DOIUrl":"https://doi.org/10.4137/BII.S11645","url":null,"abstract":"<p><p>Converting information contained in natural language clinical text into computer-amenable structured representations can automate many clinical applications. As a step towards that goal, we present a method which could help in converting novel clinical phrases into new expressions in SNOMED CT, a standard clinical terminology. Since expressions in SNOMED CT are written in terms of their relations with other SNOMED CT concepts, we formulate the important task of identifying relations between clinical phrases and SNOMED CT concepts. We present a machine learning approach for this task and using the dataset of existing SNOMED CT relations we show that it performs well. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"29-37"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11645","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31574768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez
{"title":"Using empirically constructed lexical resources for named entity recognition.","authors":"Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, Hongfang Liu, Graciela Gonzalez","doi":"10.4137/BII.S11664","DOIUrl":"https://doi.org/10.4137/BII.S11664","url":null,"abstract":"<p><p>Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes. </p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"17-27"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11664","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31574767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computational semantics in clinical text.","authors":"Stephen Wu","doi":"10.4137/BII.S11847","DOIUrl":"https://doi.org/10.4137/BII.S11847","url":null,"abstract":"This special issue of Biomedical Informatics Insights presents the full paper proceedings of the first workshop on Computational Semantics in Clinical Text (CSCT), held in 2013. Along with Nigam Shah and Kevin Bretonnel Cohen, my co-organizers, I am grateful for BII’s willingness to produce this forward-looking publication.","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"3-5"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11847","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31574765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computational semantics in clinical text supplement.","authors":"John P Pestian","doi":"10.4137/BII.S11868","DOIUrl":"https://doi.org/10.4137/BII.S11868","url":null,"abstract":"As scientists, we create and disseminate knowledge. Resources from various benefactors open the doors of discovery. Likewise, we are obliged to disseminate our finding where they will have an impact. We want our thoughts and words to be heard. Yet, neither creation nor dissemination of newfound knowledge is easy. Some facts are more stubborn than others; prying them loose and describing them takes effort and discipline. In the 1980's some of challenges to dissemination were reduced when open-access journals emerged. While the hallowed peer-review process remained, these journals provided access to knowledge without financial, legal or technical constraints to the reader. They provided an innovative venue to disseminate findings by using the world wide web as the main source of distribution. 1 The impact of these journals is growing. In 2000 there were 740 open-access journals that produced 19,500 articles. In 2009, this grew to 4769 journals and 191,850 articles; this represents 20% of scholarly publications. 2 In the open access world, the journal increasingly assumes the distribution role formerly undertaken by institutional libraries, while maintaining essential editorial quality. Intuitively, the increased accessibility of open access journals ought to lead to a greater number of citations. Numerous studies have verified this. 3 Multiple studies have shown that articles published in an open access journal are referenced more frequently than those published elsewhere.3,4 I acknowledge that other factors influence whether a paper is cited aside from its publication in an open access journal: it must be widely accessible through the channels that researchers employ and–-at the risk of making a trite argument–-the paper must have sufficient merit to justify being cited. All of this supports the emerging importance of Biomedical Informatics Insights as a vehicle for disseminating scientific findings. In this special issue we present a second series of conference proceedings. The first, Sentiment Analysis of Suicide Notes: A Shared Task, 5 produced over 20 manuscripts and was published soon after the conference. This issue reviews the scientific productivity of the first Computational Semantics in Clinical Text conference. This conference, chaired by Drs. Stephen Wu, Nigam Shah, and Kevin Bretonnel Cohen is described elsewhere, but it is an honor for Biomedical Informatics Insights to be the repository of the proceedings.","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 Suppl 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11868","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31574764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sylvia Halász, Philip Brown, Cem Oktay, Arif Alper Cevik, Isa Kılıçaslan, Colin Goodall, Dennis G Cochrane, Thomas R Fowler, Guy Jacobson, Simon Tse, John R Allegra
{"title":"Using n-Grams for Syndromic Surveillance in a Turkish Emergency Department Without English Translation: A Feasibility Study.","authors":"Sylvia Halász, Philip Brown, Cem Oktay, Arif Alper Cevik, Isa Kılıçaslan, Colin Goodall, Dennis G Cochrane, Thomas R Fowler, Guy Jacobson, Simon Tse, John R Allegra","doi":"10.4137/BII.S11334","DOIUrl":"10.4137/BII.S11334","url":null,"abstract":"<p><strong>Introduction: </strong>Syndromic surveillance is designed for early detection of disease outbreaks. An important data source for syndromic surveillance is free-text chief complaints (CCs), which are generally recorded in the local language. For automated syndromic surveillance, CCs must be classified into predefined syndromic categories. The n-gram classifier is created by using text fragments to measure associations between chief complaints (CC) and a syndromic grouping of ICD codes.</p><p><strong>Objectives: </strong>The objective was to create a Turkish n-gram CC classifier for the respiratory syndrome and then compare daily volumes between the n-gram CC classifier and a respiratory ICD-10 code grouping on a test set of data.</p><p><strong>Methods: </strong>The design was a feasibility study based on retrospective cohort data. The setting was a university hospital emergency department (ED) in Turkey. Included were all ED visits in the 2002 database of this hospital. Two of the authors created a respiratory grouping of International Classification of Diseases, 10th Revision ICD-10-CM codes by consensus, chosen to be similar to a standard respiratory (RESP) grouping of ICD codes created by the Electronic Surveillance System for Early Notification of Community-based Epidemics (ESSENCE), a project of the Centers for Disease Control and Prevention. An n-gram method adapted from AT&T Labs' technologies was applied to the first 10 months of data as a training set to create a Turkish CC RESP classifier. The classifier was then tested on the subsequent 2 months of visits to generate a time series graph and determine the correlation with daily volumes measured by the CC classifier versus the RESP ICD-10 grouping.</p><p><strong>Results: </strong>The Turkish ED database contained 30,157 visits. The correlation (R (2)) of n-gram versus ICD-10 for the test set was 0.78.</p><p><strong>Conclusion: </strong>The n-gram method automatically created a CC RESP classifier of the Turkish CCs that performed similarly to the ICD-10 RESP grouping. The n-gram technique has the advantage of systematic, consistent, and rapid deployment as well as language independence.</p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 ","pages":"29-33"},"PeriodicalIF":0.0,"publicationDate":"2013-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11334","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31450686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recognizing scientific artifacts in biomedical literature.","authors":"Tudor Groza, Hamed Hassanzadeh, Jane Hunter","doi":"10.4137/BII.S11572","DOIUrl":"https://doi.org/10.4137/BII.S11572","url":null,"abstract":"<p><p>Today's search engines and digital libraries offer little or no support for discovering those scientific artifacts (hypotheses, supporting/contradicting statements, or findings) that form the core of scientific written communication. Consequently, we currently have no means of identifying central themes within a domain or to detect gaps between accepted knowledge and newly emerging knowledge as a means for tracking the evolution of hypotheses from incipient phases to maturity or decline. We present a hybrid Machine Learning approach using an ensemble of four classifiers, for recognizing scientific artifacts (ie, hypotheses, background, motivation, objectives, and findings) within biomedical research publications, as a precursory step to the general goal of automatically creating argumentative discourse networks that span across multiple publications. The performance achieved by the classifiers ranges from 15.30% to 78.39%, subject to the target class. The set of features used for classification has led to promising results. Furthermore, their use strictly in a local, publication scope, ie, without aggregating corpus-wide statistics, increases the versatility of the ensemble of classifiers and enables its direct applicability without the necessity of re-training.</p>","PeriodicalId":88397,"journal":{"name":"Biomedical informatics insights","volume":"6 ","pages":"15-27"},"PeriodicalIF":0.0,"publicationDate":"2013-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.4137/BII.S11572","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31408254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}