Kevin Fee, Suneil Jain, Ross G Murphy, Anna Jurek-Loughrey
{"title":"Towards a Biological Evaluation Framework for Oversampling (BEFO) gene expression data.","authors":"Kevin Fee, Suneil Jain, Ross G Murphy, Anna Jurek-Loughrey","doi":"10.1016/j.jbi.2025.104932","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample's inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.</p>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":" ","pages":"104932"},"PeriodicalIF":4.5000,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jbi.2025.104932","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) techniques are progressively being used in biomedical research to improve diagnostic and prognostic accuracy when used in conjunction with a clinician as a decision support system. However, many datasets used in biomedical research often suffer from severe class imbalance due to small population sizes, which causes machine learning models to become biased to majority class samples. Current oversampling methods primarily focus on balancing datasets without adequately validating the biological relevance of synthetic data, risking the clinical applicability of downstream model predictions. To address these shortcomings, we propose the Biological Evaluation Framework for Oversampling (BEFO) designed to ensure that synthetic gene expression samples accurately reflect the biological patterns present in original datasets. This innovation not only mitigates bias but enhances the trustworthiness of predictive models in clinical scenarios. We have developed a ranking method for synthetic samples based on this and evaluated each sample's inclusion based on its rank. This ranking method calculates the WGCNA gene co-expression clusters on the original dataset. Several random forests are constructed to assess the alignment of each synthetic sample to each cluster. Only synthetic samples more important than real samples are included in a study. The experimental results demonstrate that our proposed ML oversampling framework can improve the biological feasibility of oversampled datasets by an average of 11%, leading to improved classification performance by an average of 9% when compared against five state-of-the-art (SOTA) oversampling methods and ten classification algorithms across six real world gene expressions datasets. Thereby establishing a new standard for synthetic data evaluation in biomedical ML applications.
期刊介绍:
The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.