{"title":"Case study - Feature engineering inspired by domain experts on real world medical data","authors":"Olof Björneld , Martin Carlsson , Welf Löwe","doi":"10.1016/j.ibmed.2023.100110","DOIUrl":null,"url":null,"abstract":"<div><p>To perform data mining projects for knowledge discovery based on health data produced in a daily health care stored in electronic health records (EHR) can be time consuming. This study exemplifies that the involvement of a data scientist improves classification performances. We have performed a case study that comprises two real world medical research projects, comparing feature engineering and knowledge discovery based on classification performance. Project (P1) comprised 82,742 patients with the research question “Can we predict patient falls by use of EHR data” and the second project (P2) included 23,396 patients with the focus on “Negative side effects of antiepileptic drug consumption on bone structure”.</p><p>The results concluded three salient results. (i) It is valuable for medical researchers to involve a data scientist when medical research based on real world medical data is performed. The findings were justified with an analysis of classification metrics when iteratively engineered features were used. The features were generated from domain experts and computer scientists in collaboration with medical researchers. We gave this process the name domain knowledge-driven feature engineering (KDFE).</p><p>To evaluate the classification performance the metric area under the receiver operating characteristic curve (AUROC) was used. (ii) Domain experts are benefited in quantitative terms by KDFE. When KDFE was compared to baseline, the average classification performance measured by AUROC for the engineered features rose for P1 from 0.62 to 0.82 and for P2 from 0.61 to 0.89 (p-values << 0.001). (iii) The engineered features were represented in a systematic structure, which is the foundation of a theoretical model for automated KDFE (aKDFE).</p><p>To our knowledge, this is the first study that proves that via quantitative measures KDFE adds value to real-world. However, the method is not limited to the medical domain. Other areas with similar data properties should also benefit from KDFE.</p></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"8 ","pages":"Article 100110"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666521223000248","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
To perform data mining projects for knowledge discovery based on health data produced in a daily health care stored in electronic health records (EHR) can be time consuming. This study exemplifies that the involvement of a data scientist improves classification performances. We have performed a case study that comprises two real world medical research projects, comparing feature engineering and knowledge discovery based on classification performance. Project (P1) comprised 82,742 patients with the research question “Can we predict patient falls by use of EHR data” and the second project (P2) included 23,396 patients with the focus on “Negative side effects of antiepileptic drug consumption on bone structure”.
The results concluded three salient results. (i) It is valuable for medical researchers to involve a data scientist when medical research based on real world medical data is performed. The findings were justified with an analysis of classification metrics when iteratively engineered features were used. The features were generated from domain experts and computer scientists in collaboration with medical researchers. We gave this process the name domain knowledge-driven feature engineering (KDFE).
To evaluate the classification performance the metric area under the receiver operating characteristic curve (AUROC) was used. (ii) Domain experts are benefited in quantitative terms by KDFE. When KDFE was compared to baseline, the average classification performance measured by AUROC for the engineered features rose for P1 from 0.62 to 0.82 and for P2 from 0.61 to 0.89 (p-values << 0.001). (iii) The engineered features were represented in a systematic structure, which is the foundation of a theoretical model for automated KDFE (aKDFE).
To our knowledge, this is the first study that proves that via quantitative measures KDFE adds value to real-world. However, the method is not limited to the medical domain. Other areas with similar data properties should also benefit from KDFE.