Anastasios Lamproudis, Sara Mora, Therese Olsen Svenning, Torbjørn Torsvik, Taridzo Chomutare, Phuong Dinh Ngo, Hercules Dalianis
{"title":"De-identifying Norwegian Clinical Text using Resources from Swedish and Danish.","authors":"Anastasios Lamproudis, Sara Mora, Therese Olsen Svenning, Torbjørn Torsvik, Taridzo Chomutare, Phuong Dinh Ngo, Hercules Dalianis","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>The lack of relevant annotated datasets represents one key limitation in the application of Natural Language Processing techniques in a broad number of tasks, among them Protected Health Information (PHI) identification in Norwegian clinical text. In this work, the possibility of exploiting resources from Swedish, a very closely related language, to Norwegian is explored. The Swedish dataset is annotated with PHI information. Different processing and text augmentation techniques are evaluated, along with their impact in the final performance of the model. The augmentation techniques, such as injection and generation of both Norwegian and Scandinavian Named Entities into the Swedish training corpus, showed to increase the performance in the de-identification task for both Danish and Norwegian text. This trend was also confirmed by the evaluation of model performance on a sample Norwegian gastro surgical clinical text.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2023 ","pages":"456-464"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10785939/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA ... Annual Symposium proceedings. AMIA Symposium","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The lack of relevant annotated datasets represents one key limitation in the application of Natural Language Processing techniques in a broad number of tasks, among them Protected Health Information (PHI) identification in Norwegian clinical text. In this work, the possibility of exploiting resources from Swedish, a very closely related language, to Norwegian is explored. The Swedish dataset is annotated with PHI information. Different processing and text augmentation techniques are evaluated, along with their impact in the final performance of the model. The augmentation techniques, such as injection and generation of both Norwegian and Scandinavian Named Entities into the Swedish training corpus, showed to increase the performance in the de-identification task for both Danish and Norwegian text. This trend was also confirmed by the evaluation of model performance on a sample Norwegian gastro surgical clinical text.