Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, L. A. F. D. Oliveira, Carolina de Oliveira Montenegro, Laura Rubel Barzotto, C. Moro, A. Pagano, E. Paraiso
{"title":"Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese","authors":"Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, L. A. F. D. Oliveira, Carolina de Oliveira Montenegro, Laura Rubel Barzotto, C. Moro, A. Pagano, E. Paraiso","doi":"10.59681/2175-4411.v15.iespecial.2023.1086","DOIUrl":null,"url":null,"abstract":"Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.","PeriodicalId":91119,"journal":{"name":"Journal of health informatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of health informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59681/2175-4411.v15.iespecial.2023.1086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.