Phenotype driven data augmentation methods for transcriptomic data.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-05-23 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf124

Nikita Janakarajan, Mara Graziani, María Rodríguez Martínez

{"title":"Phenotype driven data augmentation methods for transcriptomic data.","authors":"Nikita Janakarajan, Mara Graziani, María Rodríguez Martínez","doi":"10.1093/bioadv/vbaf124","DOIUrl":null,"url":null,"abstract":"Summary: The application of machine learning methods to biomedical applications has seen many successes. However, working with transcriptomic data on supervised learning tasks is challenging due to its high dimensionality, low patient numbers, and class imbalances. Machine learning models tend to overfit these data and do not generalize well on out-of-distribution samples. Data augmentation strategies help alleviate this by introducing synthetic data points and acting as regularizers. However, existing approaches are either computationally intensive, require population parametric estimates, or generate insufficiently diverse samples. To address these challenges, we introduce two classes of phenotype-driven data augmentation approaches-signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. As case studies, we apply our augmentation methods to transcriptomic data of colorectal and breast cancer. Through discriminative and generative experiments with external validation, we show that our methods improve patient stratification by <math><mrow><mn>5</mn> <mo>-</mo> <mn>15</mn> <mi>%</mi></mrow> </math> over other augmentation methods in their respective cases. The study additionally provides insights into the limited benefits of over-augmenting data.Availability and implementation: Code for reproducibility is available on GitHub.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf124"},"PeriodicalIF":2.8000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12141816/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Summary: The application of machine learning methods to biomedical applications has seen many successes. However, working with transcriptomic data on supervised learning tasks is challenging due to its high dimensionality, low patient numbers, and class imbalances. Machine learning models tend to overfit these data and do not generalize well on out-of-distribution samples. Data augmentation strategies help alleviate this by introducing synthetic data points and acting as regularizers. However, existing approaches are either computationally intensive, require population parametric estimates, or generate insufficiently diverse samples. To address these challenges, we introduce two classes of phenotype-driven data augmentation approaches-signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. As case studies, we apply our augmentation methods to transcriptomic data of colorectal and breast cancer. Through discriminative and generative experiments with external validation, we show that our methods improve patient stratification by $5 - 15 %$ over other augmentation methods in their respective cases. The study additionally provides insights into the limited benefits of over-augmenting data.

Availability and implementation: Code for reproducibility is available on GitHub.

查看原文本刊更多论文

表型驱动的转录组数据增强方法。

摘要：机器学习方法在生物医学领域的应用已经取得了许多成功。然而，在监督学习任务中使用转录组数据是具有挑战性的，因为它的高维、低患者数量和班级不平衡。机器学习模型倾向于过拟合这些数据，并且不能很好地泛化分布外的样本。数据增强策略通过引入合成数据点并充当正则化器来帮助缓解这一问题。然而，现有的方法要么计算量大，要么需要总体参数估计，要么产生的样本不够多样化。为了应对这些挑战，我们引入了两类表型驱动的数据增强方法——特征依赖和特征独立。特征依赖方法假设存在描述某些表型的不同基因特征，并且是简单，非参数和新颖的数据增强方法。特征无关的方法是对已建立的基因表达数据的γ -泊松和泊松采样方法的改进。作为案例研究，我们将我们的增强方法应用于结直肠癌和乳腺癌的转录组学数据。通过外部验证的判别和生成实验，我们表明我们的方法在各自的病例中比其他增强方法提高了5 - 15%的患者分层。该研究还提供了对过度扩展数据的有限好处的见解。可用性和实现：可再现性代码可在GitHub上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量