P. Deshpande, A. Rasin, Roselyne B. Tchoua, J. Furst, D. Raicu, Sameer Kiran Antani
{"title":"Enhancing Recall Using Data Cleaning for Biomedical Big Data","authors":"P. Deshpande, A. Rasin, Roselyne B. Tchoua, J. Furst, D. Raicu, Sameer Kiran Antani","doi":"10.1109/CBMS49503.2020.00057","DOIUrl":null,"url":null,"abstract":"In clinical practice, large amounts of heterogeneous medical data are generated on a daily basis. This data has the potential to be used for biomedical research and as a diagnostic reference for physicians. However, leveraging heterogeneous data for analysis requires integrating it first. Integration process includes a pre-processing data cleaning phase that eliminates inconsistencies and errors originating from each data source. In this paper, we describe a workflow for cleaning heterogeneous biomedical data sources. Our novel data cleaning approach can be applied for replacement of missing text and to improve the number of relevant cases retrieved by search queries. When the threshold for missing category replacement is met, our results show that our method achieves a missing content replacement precision of 85%, which represents an improvement of 18% over the baseline state of our datasets.","PeriodicalId":121059,"journal":{"name":"2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS49503.2020.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In clinical practice, large amounts of heterogeneous medical data are generated on a daily basis. This data has the potential to be used for biomedical research and as a diagnostic reference for physicians. However, leveraging heterogeneous data for analysis requires integrating it first. Integration process includes a pre-processing data cleaning phase that eliminates inconsistencies and errors originating from each data source. In this paper, we describe a workflow for cleaning heterogeneous biomedical data sources. Our novel data cleaning approach can be applied for replacement of missing text and to improve the number of relevant cases retrieved by search queries. When the threshold for missing category replacement is met, our results show that our method achieves a missing content replacement precision of 85%, which represents an improvement of 18% over the baseline state of our datasets.