João Vitor Andrioli de Souza, Yohan Bonescki Gumiel, Lucas E. S. Oliveira, C. Moro
{"title":"Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups","authors":"João Vitor Andrioli de Souza, Yohan Bonescki Gumiel, Lucas E. S. Oliveira, C. Moro","doi":"10.5753/SBCAS.2019.6269","DOIUrl":null,"url":null,"abstract":"Considering the difficulties of extracting entities from Electronic Health Records (EHR) texts in Portuguese, we explore the Conditional Random Fields (CRF) algorithm to build a Named Entity Recognition (NER) system based on a corpus of clinical Portuguese data annotated by experts. We acquaint the challenges and methods to classify Abbreviations, Disorders, Procedures and Chemicals within the texts. By selecting a meaningful set of features, and parameters with the best performance the results demonstrate that the method is promising and may support other biomedical tasks, nonetheless, further experiments with more features, different architectures and sophisticated preprocessing steps are needed.","PeriodicalId":229405,"journal":{"name":"Anais do Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2019)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/SBCAS.2019.6269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Considering the difficulties of extracting entities from Electronic Health Records (EHR) texts in Portuguese, we explore the Conditional Random Fields (CRF) algorithm to build a Named Entity Recognition (NER) system based on a corpus of clinical Portuguese data annotated by experts. We acquaint the challenges and methods to classify Abbreviations, Disorders, Procedures and Chemicals within the texts. By selecting a meaningful set of features, and parameters with the best performance the results demonstrate that the method is promising and may support other biomedical tasks, nonetheless, further experiments with more features, different architectures and sophisticated preprocessing steps are needed.