Clipper: An efficient cluster-based data pruning technique for biomedical data to increase the accuracy of machine learning model prediction

IF 4.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Egyptian Informatics Journal Pub Date : 2025-03-20 DOI:10.1016/j.eij.2025.100641

M.B. Karadeniz , Ebru Efeoğlu , Burak Çelik , Adem Kocyigit , Bahattin Türetken

{"title":"Clipper: An efficient cluster-based data pruning technique for biomedical data to increase the accuracy of machine learning model prediction","authors":"M.B. Karadeniz , Ebru Efeoğlu , Burak Çelik , Adem Kocyigit , Bahattin Türetken","doi":"10.1016/j.eij.2025.100641","DOIUrl":null,"url":null,"abstract":"<div><div>The exponential rise in clinical research costs can potentially be mitigated by half through the implementation of machine learning-driven efficient data processing techniques. Traditional methods like data preprocessing and hyperparameter tuning, which are effective for model optimization, often introduce complexities that can diminish the benefits of machine learning integration. To overcome this issue, we present Clipper: a novel, cluster-based data pruning approach designed specifically for biomedical data, aiming to enhance the predictive accuracy of machine learning models. Clipper’s key advantage lies in its ability to automate the data pruning process, optimizing accuracy without the need for manual hyperparameter adjustments—a typically cumbersome aspect of machine learning tasks. Upon comprehensive comparative analysis, the proposed Clipper methodology demonstrates superior performance across various medical and biological datasets. Our experiments reveal Clipper’s consistent superiority over baseline models, with significant accuracy improvements: 44% for Heart Disease, 7% for Breast Cancer, 40% for Parkinson’s, and 20% for Raisin classification. Specifically, the model achieves remarkable predictive accuracy, with classification rates of 99.5% for Heart Disease, 99.64% for Breast Cancer, 99.47% for Parkinson’s Disease, and 93% for Raisin Classification, thereby substantially outperforming contemporary state-of-the-art computational techniques. The empirical evidence suggests that Clipper serves as an effective accuracy enhancer for baseline models, eliminating the need for parameter tuning or complex preprocessing steps. Furthermore, Clipper produces robust outputs even at very low split rates, where baseline models typically perform poorly.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"30 ","pages":"Article 100641"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000349","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The exponential rise in clinical research costs can potentially be mitigated by half through the implementation of machine learning-driven efficient data processing techniques. Traditional methods like data preprocessing and hyperparameter tuning, which are effective for model optimization, often introduce complexities that can diminish the benefits of machine learning integration. To overcome this issue, we present Clipper: a novel, cluster-based data pruning approach designed specifically for biomedical data, aiming to enhance the predictive accuracy of machine learning models. Clipper’s key advantage lies in its ability to automate the data pruning process, optimizing accuracy without the need for manual hyperparameter adjustments—a typically cumbersome aspect of machine learning tasks. Upon comprehensive comparative analysis, the proposed Clipper methodology demonstrates superior performance across various medical and biological datasets. Our experiments reveal Clipper’s consistent superiority over baseline models, with significant accuracy improvements: 44% for Heart Disease, 7% for Breast Cancer, 40% for Parkinson’s, and 20% for Raisin classification. Specifically, the model achieves remarkable predictive accuracy, with classification rates of 99.5% for Heart Disease, 99.64% for Breast Cancer, 99.47% for Parkinson’s Disease, and 93% for Raisin Classification, thereby substantially outperforming contemporary state-of-the-art computational techniques. The empirical evidence suggests that Clipper serves as an effective accuracy enhancer for baseline models, eliminating the need for parameter tuning or complex preprocessing steps. Furthermore, Clipper produces robust outputs even at very low split rates, where baseline models typically perform poorly.

查看原文本刊更多论文

Clipper：一种高效的基于聚类的生物医学数据修剪技术，用于提高机器学习模型预测的准确性

通过实施机器学习驱动的高效数据处理技术，临床研究成本的指数级增长可能会减少一半。数据预处理和超参数调优等传统方法对模型优化是有效的，但往往会引入复杂性，从而降低机器学习集成的好处。为了克服这个问题，我们提出了Clipper：一种专门为生物医学数据设计的新颖的基于聚类的数据修剪方法，旨在提高机器学习模型的预测准确性。Clipper的主要优势在于它能够自动化数据修剪过程，在不需要手动超参数调整的情况下优化准确性——这是机器学习任务中典型的繁琐方面。经过综合比较分析，提出的Clipper方法在各种医学和生物数据集上表现出优越的性能。我们的实验表明，Clipper与基线模型相比具有一贯的优势，准确率显著提高：心脏病准确率为44%，乳腺癌准确率为7%，帕金森病准确率为40%，葡萄干分类准确率为20%。具体来说，该模型实现了显著的预测准确率，心脏病的分类率为99.5%，乳腺癌的分类率为99.64%，帕金森病的分类率为99.47%，葡萄干分类率为93%，从而大大优于当代最先进的计算技术。经验证据表明，Clipper作为基线模型的有效精度增强器，消除了参数调整或复杂预处理步骤的需要。此外，Clipper即使在非常低的分割率下也能产生稳健的输出，而基线模型通常表现不佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research

CiteScore

11.10

自引率

1.90%

发文量

审稿时长

110 days

期刊介绍： The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.