Luiz Henrique Pereira Niero, João Vitor Andrioli de Souza, Luciana Martins Gomes da Silva, Yohan Bonescki Gumiel, N. H. Borges, G. Piotto, Gustavo Giavarini, Lucas E. S. Oliveira
{"title":"Challenges and Issues on Extracting Named Entities from Oncology Clinical Notes","authors":"Luiz Henrique Pereira Niero, João Vitor Andrioli de Souza, Luciana Martins Gomes da Silva, Yohan Bonescki Gumiel, N. H. Borges, G. Piotto, Gustavo Giavarini, Lucas E. S. Oliveira","doi":"10.59681/2175-4411.v15.iespecial.2023.1097","DOIUrl":null,"url":null,"abstract":"This article aims to describe the annotation process of a multi-institutional corpus of clinical texts in the oncology specialty and to train models for the Recognition of Named Entities. We use the annotated corpus to train models with different amounts of data and compare the model result with the amount of data used in training. The training of the models was done from the fine-tuning of the Bidirectional Encoder Representations from Transformers adapted to the medical-biological domain of the Portuguese language (BioBERTpt). To compare model behavior with increasing training data, models were trained with incremental amounts of data. As a result, we found that models trained with smaller but fully revised datasets performed better than models trained with larger datasets with little revision.","PeriodicalId":91119,"journal":{"name":"Journal of health informatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of health informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59681/2175-4411.v15.iespecial.2023.1097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This article aims to describe the annotation process of a multi-institutional corpus of clinical texts in the oncology specialty and to train models for the Recognition of Named Entities. We use the annotated corpus to train models with different amounts of data and compare the model result with the amount of data used in training. The training of the models was done from the fine-tuning of the Bidirectional Encoder Representations from Transformers adapted to the medical-biological domain of the Portuguese language (BioBERTpt). To compare model behavior with increasing training data, models were trained with incremental amounts of data. As a result, we found that models trained with smaller but fully revised datasets performed better than models trained with larger datasets with little revision.