{"title":"Domain adaptation of transformer-based neural network model for clinical note classification in Indian healthcare","authors":"Swati Saigaonkar, Vaibhav Narawade","doi":"10.1007/s41870-024-02053-z","DOIUrl":null,"url":null,"abstract":"<p>The exploration of clinical notes has garnered attention, primarily owing to the wealth of unstructured information they encompass. Although substantial research has been carried out, notable gaps persist. One such gap pertains to the absence of work on real-time Indian data. The work commenced by initially using Medical Information Mart for Intensive Care (MIMIC III) dataset, concentrating on diseases such as Asthma, Myocardial Infarction (MI), and Chronic Kidney Diseases (CKD), for training the model. A novel model, transformer-based, was built which incorporated knowledge of abbreviations, symptoms, and domain knowledge of the diseases, named as SM-DBERT + + . Subsequently, the model was applied to an Indian dataset using transfer learning, where domain knowledge extracted from Indian sources was utilized to adapt to domain differences. Further, an ensemble of pre-trained models was built, applying transfer learning principles. Through this comprehensive methodology, we aimed to bridge the gap pertaining to the application of deep learning techniques to Indian healthcare datasets. The results obtained were better than fine-tuned Bidirectional Encoder Representations from Transformers (BERT), Distilled BERT (DISTILBERT) and A Lite BERT (ALBERT) models and also other specialized models like Scientific BERT (SCI-BERT), Clinical Biomedical BERT (Clinical Bio-BERT), and Biomedical BERT (BIOBERT) with an accuracy of 0.93 when full notes were used and an accuracy of 0.89 when extracted sections were used. It has demonstrated that model trained on one dataset yields good results on another similar dataset as this model incorporates domain knowledge which could be modified during transfer learning to adapt to the new domain.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"107 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02053-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The exploration of clinical notes has garnered attention, primarily owing to the wealth of unstructured information they encompass. Although substantial research has been carried out, notable gaps persist. One such gap pertains to the absence of work on real-time Indian data. The work commenced by initially using Medical Information Mart for Intensive Care (MIMIC III) dataset, concentrating on diseases such as Asthma, Myocardial Infarction (MI), and Chronic Kidney Diseases (CKD), for training the model. A novel model, transformer-based, was built which incorporated knowledge of abbreviations, symptoms, and domain knowledge of the diseases, named as SM-DBERT + + . Subsequently, the model was applied to an Indian dataset using transfer learning, where domain knowledge extracted from Indian sources was utilized to adapt to domain differences. Further, an ensemble of pre-trained models was built, applying transfer learning principles. Through this comprehensive methodology, we aimed to bridge the gap pertaining to the application of deep learning techniques to Indian healthcare datasets. The results obtained were better than fine-tuned Bidirectional Encoder Representations from Transformers (BERT), Distilled BERT (DISTILBERT) and A Lite BERT (ALBERT) models and also other specialized models like Scientific BERT (SCI-BERT), Clinical Biomedical BERT (Clinical Bio-BERT), and Biomedical BERT (BIOBERT) with an accuracy of 0.93 when full notes were used and an accuracy of 0.89 when extracted sections were used. It has demonstrated that model trained on one dataset yields good results on another similar dataset as this model incorporates domain knowledge which could be modified during transfer learning to adapt to the new domain.