Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data.

Transactions on artificial intelligence Pub Date : 2025-01-01 Epub Date: 2025-05-25 DOI:10.53941/tai.2025.100005

Fei Deng, Catherine H Feng, Nan Gao, Lanjing Zhang

{"title":"Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data.","authors":"Fei Deng, Catherine H Feng, Nan Gao, Lanjing Zhang","doi":"10.53941/tai.2025.100005","DOIUrl":null,"url":null,"abstract":"Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p > 0.85) and differentially expressed genes (DEG) (p < 0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.","PeriodicalId":520933,"journal":{"name":"Transactions on artificial intelligence","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12235674/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53941/tai.2025.100005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/25 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p > 0.85) and differentially expressed genes (DEG) (p < 0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.

查看原文本刊更多论文

标准化和选择非差异表达基因改进了跨平台转录组学数据的机器学习建模。

归一化是定量分析生物过程的关键步骤。最近的研究表明，跨平台集成和标准化使机器学习（ML）训练能够在RNA微阵列和RNA-seq数据上进行，但在他们的研究中没有使用独立的数据集。因此，如何在独立RNA阵列和基于RNA-seq的数据集上提高ML建模性能尚不清楚。受实验生物学中常用的管家基因的启发，本研究验证了非差异表达基因（NDEG）可能改善转录组数据的规范化和ML模型的跨平台建模性能的假设。采用TCGA乳腺癌的Microarray和RNA-seq数据集分别作为独立的训练数据集和测试数据集，对乳腺癌的分子亚型进行分类。根据方差分析的p值选择NDEG （p < 0.85）和差异表达基因（DEG）（p < 0.05），分别用于后续的数据归一化和分类。基于来自一个平台的数据训练的模型用于在另一个平台上进行测试。我们的数据表明，NDEG和DEG基因选择可以有效地提高模型的分类性能。基于参数统计分析的归一化方法不如基于非参数统计的归一化方法。在本研究中，LOG_QN和LOG_QNZ归一化方法结合神经网络分类模型似乎取得了更好的性能。因此，基于ndeg的规范化对于在完全独立的数据集上进行跨平台测试显得很有用。然而，基于ndeg的归一化是否可以提高ML在其他数据集和其他组学数据类型中的分类性能，还需要更多的研究来检验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transactions on artificial intelligence

自引率

0.00%

发文量