Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data.

ArXiv Pub Date : 2025-02-27

Fei Deng, Lanjing Zhang

{"title":"Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data.","authors":"Fei Deng, Lanjing Zhang","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Cross-dataset testing is critical for examining machine learning (ML) model's performance. However, most studies on modelling transcriptomic and clinical data only conducted intra-dataset testing. It is also unclear whether normalization and non-differentially expressed genes (NDEG) can improve cross-dataset modeling performance of ML. We thus aim to understand whether normalization, NDEG and data source are associated with performance of ML in cross-dataset testing. The transcriptomic and clinical data shared by the lung adenocarcinoma cases in TCGA and ONCOSG were used. The best cross-dataset ML performance was reached using transcriptomic data alone and statistically better than those using transcriptomic and clinical data. The best balance accuracy (BA), area under curve (AUC) and accuracy were significantly better in ML algorithms training on TCGA and tested on ONCOSG than those trained on ONCOSG and tested on TCGA (p<0.05 for all). Normalization and NDEG greatly improved intra-dataset ML performances in both datasets, but not in cross-dataset testing. Strikingly, modelling transcriptomic data of ONCOSG alone outperformed modelling transcriptomic and clinical data whereas including clinical data in TCGA did not significantly impact ML performance, suggesting limited clinical data value or an overwhelming influence of transcriptomic data in TCGA. Performance gains in intra-dataset testing were more pronounced for ML models trained on ONCOSG than TCGA. Among the six ML models compared, Support vector machine was the most frequent best-performer in both intra-dataset and cross-dataset testing. Therefore, our data show data source, normalization and NDEG are associated with intra-dataset and cross-dataset ML performance in modelling transcriptomic and clinical data.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11888557/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-dataset testing is critical for examining machine learning (ML) model's performance. However, most studies on modelling transcriptomic and clinical data only conducted intra-dataset testing. It is also unclear whether normalization and non-differentially expressed genes (NDEG) can improve cross-dataset modeling performance of ML. We thus aim to understand whether normalization, NDEG and data source are associated with performance of ML in cross-dataset testing. The transcriptomic and clinical data shared by the lung adenocarcinoma cases in TCGA and ONCOSG were used. The best cross-dataset ML performance was reached using transcriptomic data alone and statistically better than those using transcriptomic and clinical data. The best balance accuracy (BA), area under curve (AUC) and accuracy were significantly better in ML algorithms training on TCGA and tested on ONCOSG than those trained on ONCOSG and tested on TCGA (p<0.05 for all). Normalization and NDEG greatly improved intra-dataset ML performances in both datasets, but not in cross-dataset testing. Strikingly, modelling transcriptomic data of ONCOSG alone outperformed modelling transcriptomic and clinical data whereas including clinical data in TCGA did not significantly impact ML performance, suggesting limited clinical data value or an overwhelming influence of transcriptomic data in TCGA. Performance gains in intra-dataset testing were more pronounced for ML models trained on ONCOSG than TCGA. Among the six ML models compared, Support vector machine was the most frequent best-performer in both intra-dataset and cross-dataset testing. Therefore, our data show data source, normalization and NDEG are associated with intra-dataset and cross-dataset ML performance in modelling transcriptomic and clinical data.

本刊更多论文

在转录组学和临床数据的数据集内或跨数据集建模中，规范化、非差异表达基因和数据源与机器学习性能的关联。

跨数据集测试是检验机器学习（ML）模型性能的关键。然而，大多数模拟转录组学和临床数据的研究只进行了数据集内测试。目前还不清楚归一化和非差异表达基因（NDEG）是否能提高机器学习的跨数据集建模性能。因此，我们旨在了解归一化、NDEG和数据源是否与机器学习在跨数据集测试中的性能有关。使用TCGA和ONCOSG肺腺癌病例共享的转录组学和临床数据。单独使用转录组数据达到了最佳的跨数据集ML性能，并且在统计上优于使用转录组和临床数据。在TCGA上训练并在ONCOSG上测试的ML算法的最佳平衡精度、曲线下面积和精度显著优于在ONCOSG上训练并在TCGA上测试的ML算法(p

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量