Class-balanced negative training sets for improving classifier model predictions of enhancer-promoter interactions.

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-06-02 DOI:10.1186/s12859-025-06171-8

Osamu Maruyama, Tsukasa Koga

{"title":"Class-balanced negative training sets for improving classifier model predictions of enhancer-promoter interactions.","authors":"Osamu Maruyama, Tsukasa Koga","doi":"10.1186/s12859-025-06171-8","DOIUrl":null,"url":null,"abstract":"Background: Enhancers regulate gene expression by forming DNA loops, thereby bringing themselves in close proximity to the target gene promoter. The human genome contains hundreds of thousands of enhancers, vastly outnumbering its 20,000-25,000 protein-coding genes, highlighting the importance of enhancer-promoter interactions (EPIs) in gene regulation. Supervised learning models have been developed to predict EPIs, often using experimentally validated interacting enhancer-promoter pairs and artificially generated negative samples. However, the lack of reliable negative samples presents a challenge. Current methods randomly select pairs from unlabeled data, leading to class imbalance and reduced predictive performance. This imbalance, where enhancers and promoters are unevenly distributed between the positive and negative sets, hinders classifiers from learning meaningful patterns. Therefore, constructing more reliable negative samples is crucial for improving the accuracy of EPI predictions.Results: We developed two methods to generate class-balanced negative training sets for EPI classifiers: one based on maximum flow and the other on Gibbs sampling. We evaluated these methods with the TargetFinder and TransEPI classifiers across five and six cell lines, respectively. The trained models were tested using a common negative test set. Our negative training sets significantly improved the prediction performance across several metrics, including precision, recall, and area under the receiver operating characteristic curve.Conclusions: Our findings demonstrate that carefully designed negative samples can enhance the performance of EPI classifiers. Further advanced methods in generating negative EPIs should further improve prediction accuracy. The source code is available at https://github.com/maruyama-lab-design/CBOEP2 .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"145"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12131720/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06171-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Enhancers regulate gene expression by forming DNA loops, thereby bringing themselves in close proximity to the target gene promoter. The human genome contains hundreds of thousands of enhancers, vastly outnumbering its 20,000-25,000 protein-coding genes, highlighting the importance of enhancer-promoter interactions (EPIs) in gene regulation. Supervised learning models have been developed to predict EPIs, often using experimentally validated interacting enhancer-promoter pairs and artificially generated negative samples. However, the lack of reliable negative samples presents a challenge. Current methods randomly select pairs from unlabeled data, leading to class imbalance and reduced predictive performance. This imbalance, where enhancers and promoters are unevenly distributed between the positive and negative sets, hinders classifiers from learning meaningful patterns. Therefore, constructing more reliable negative samples is crucial for improving the accuracy of EPI predictions.

Results: We developed two methods to generate class-balanced negative training sets for EPI classifiers: one based on maximum flow and the other on Gibbs sampling. We evaluated these methods with the TargetFinder and TransEPI classifiers across five and six cell lines, respectively. The trained models were tested using a common negative test set. Our negative training sets significantly improved the prediction performance across several metrics, including precision, recall, and area under the receiver operating characteristic curve.

Conclusions: Our findings demonstrate that carefully designed negative samples can enhance the performance of EPI classifiers. Further advanced methods in generating negative EPIs should further improve prediction accuracy. The source code is available at https://github.com/maruyama-lab-design/CBOEP2 .

Abstract Image

查看原文本刊更多论文

类平衡负训练集用于改进分类器模型对增强器-启动器相互作用的预测。

背景：增强子通过形成DNA环来调节基因表达，从而使自己靠近目标基因启动子。人类基因组包含数十万个增强子，远远超过其20,000-25,000个蛋白质编码基因，突出了增强子-启动子相互作用（EPIs）在基因调控中的重要性。监督学习模型已被开发用于预测epi，通常使用实验验证的相互作用增强子-启动子对和人工生成的阴性样本。然而，缺乏可靠的阴性样本提出了一个挑战。目前的方法是从未标记的数据中随机选择对，导致类不平衡，降低了预测性能。这种不平衡，即增强子和启动子在正集和负集之间分布不均，阻碍了分类器学习有意义的模式。因此，构建更可靠的负样本对于提高EPI预测的准确性至关重要。结果：我们开发了两种方法来为EPI分类器生成类平衡负训练集：一种基于最大流量，另一种基于吉布斯抽样。我们分别用TargetFinder和TransEPI分类器在5个和6个细胞系中评估了这些方法。训练后的模型使用通用负测试集进行测试。我们的负训练集显著提高了几个指标的预测性能，包括精度、召回率和接收者工作特征曲线下的面积。结论：我们的研究结果表明，精心设计的阴性样本可以提高EPI分类器的性能。进一步发展生成负epi的方法将进一步提高预测精度。源代码可从https://github.com/maruyama-lab-design/CBOEP2获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.