Case study - Feature engineering inspired by domain experts on real world medical data

Intelligence-based medicine Pub Date : 2023-01-01 DOI:10.1016/j.ibmed.2023.100110

Olof Björneld , Martin Carlsson , Welf Löwe

{"title":"Case study - Feature engineering inspired by domain experts on real world medical data","authors":"Olof Björneld , Martin Carlsson , Welf Löwe","doi":"10.1016/j.ibmed.2023.100110","DOIUrl":null,"url":null,"abstract":"<div><p>To perform data mining projects for knowledge discovery based on health data produced in a daily health care stored in electronic health records (EHR) can be time consuming. This study exemplifies that the involvement of a data scientist improves classification performances. We have performed a case study that comprises two real world medical research projects, comparing feature engineering and knowledge discovery based on classification performance. Project (P1) comprised 82,742 patients with the research question “Can we predict patient falls by use of EHR data” and the second project (P2) included 23,396 patients with the focus on “Negative side effects of antiepileptic drug consumption on bone structure”.</p><p>The results concluded three salient results. (i) It is valuable for medical researchers to involve a data scientist when medical research based on real world medical data is performed. The findings were justified with an analysis of classification metrics when iteratively engineered features were used. The features were generated from domain experts and computer scientists in collaboration with medical researchers. We gave this process the name domain knowledge-driven feature engineering (KDFE).</p><p>To evaluate the classification performance the metric area under the receiver operating characteristic curve (AUROC) was used. (ii) Domain experts are benefited in quantitative terms by KDFE. When KDFE was compared to baseline, the average classification performance measured by AUROC for the engineered features rose for P1 from 0.62 to 0.82 and for P2 from 0.61 to 0.89 (p-values << 0.001). (iii) The engineered features were represented in a systematic structure, which is the foundation of a theoretical model for automated KDFE (aKDFE).</p><p>To our knowledge, this is the first study that proves that via quantitative measures KDFE adds value to real-world. However, the method is not limited to the medical domain. Other areas with similar data properties should also benefit from KDFE.</p></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"8 ","pages":"Article 100110"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666521223000248","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

To perform data mining projects for knowledge discovery based on health data produced in a daily health care stored in electronic health records (EHR) can be time consuming. This study exemplifies that the involvement of a data scientist improves classification performances. We have performed a case study that comprises two real world medical research projects, comparing feature engineering and knowledge discovery based on classification performance. Project (P1) comprised 82,742 patients with the research question “Can we predict patient falls by use of EHR data” and the second project (P2) included 23,396 patients with the focus on “Negative side effects of antiepileptic drug consumption on bone structure”.

The results concluded three salient results. (i) It is valuable for medical researchers to involve a data scientist when medical research based on real world medical data is performed. The findings were justified with an analysis of classification metrics when iteratively engineered features were used. The features were generated from domain experts and computer scientists in collaboration with medical researchers. We gave this process the name domain knowledge-driven feature engineering (KDFE).

To evaluate the classification performance the metric area under the receiver operating characteristic curve (AUROC) was used. (ii) Domain experts are benefited in quantitative terms by KDFE. When KDFE was compared to baseline, the average classification performance measured by AUROC for the engineered features rose for P1 from 0.62 to 0.82 and for P2 from 0.61 to 0.89 (p-values << 0.001). (iii) The engineered features were represented in a systematic structure, which is the foundation of a theoretical model for automated KDFE (aKDFE).

To our knowledge, this is the first study that proves that via quantitative measures KDFE adds value to real-world. However, the method is not limited to the medical domain. Other areas with similar data properties should also benefit from KDFE.

查看原文本刊更多论文

案例研究-特征工程的灵感来自领域专家对现实世界的医疗数据

基于存储在电子健康记录（EHR）中的日常医疗保健中产生的健康数据来执行用于知识发现的数据挖掘项目可能是耗时的。这项研究表明，数据科学家的参与可以提高分类性能。我们进行了一个案例研究，包括两个真实世界的医学研究项目，比较了特征工程和基于分类性能的知识发现。项目（P1）包括82742名患者，研究问题是“我们能利用EHR数据预测患者跌倒吗”，第二个项目（P2）包括23396名患者，重点是“服用抗癫痫药物对骨结构的负面副作用”。（i）当基于真实世界医学数据进行医学研究时，让数据科学家参与进来对医学研究人员来说是很有价值的。当使用迭代设计的特征时，通过对分类指标的分析来证明这些发现是合理的。这些特征是由领域专家和计算机科学家与医学研究人员合作生成的。我们将这一过程称为名称域知识驱动特征工程（KDFE）。为了评估分类性能，我们使用了接收器工作特性曲线下的度量区域（AUROC）。（ii）KDFE在数量方面使领域专家受益。当将KDFE与基线进行比较时，AUROC测量的工程特征的平均分类性能P1从0.62上升到0.82，P2从0.61上升到0.89（p值<；<；0.001）。（iii）工程特征以系统结构表示，这是自动化KDFE（aKDFE）理论模型的基础。据我们所知，这是第一项通过定量测量证明KDFE为现实世界增加价值的研究。然而，该方法并不局限于医学领域。具有类似数据属性的其他区域也应该从KDFE中受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊