Predictive biomarkers for embryotoxicity: a machine learning approach to mitigating multicollinearity in RNA-Seq.

IF 4.8 2区 医学 Q1 TOXICOLOGY
Yixian Quah, Soontag Jung, Jireh Yi-Le Chan, Onju Ham, Ji-Seong Jeong, Sangyun Kim, Woojin Kim, Seung-Chun Park, Seung-Jin Lee, Wook-Joon Yu
{"title":"Predictive biomarkers for embryotoxicity: a machine learning approach to mitigating multicollinearity in RNA-Seq.","authors":"Yixian Quah, Soontag Jung, Jireh Yi-Le Chan, Onju Ham, Ji-Seong Jeong, Sangyun Kim, Woojin Kim, Seung-Chun Park, Seung-Jin Lee, Wook-Joon Yu","doi":"10.1007/s00204-024-03852-w","DOIUrl":null,"url":null,"abstract":"<p><p>Multicollinearity, characterized by significant co-expression patterns among genes, often occurs in high-throughput expression data, potentially impacting the predictive model's reliability. This study examined multicollinearity among closely related genes, particularly in RNA-Seq data obtained from embryoid bodies (EB) exposed to 5-fluorouracil perturbation to identify genes associated with embryotoxicity. Six genes-Dppa5a, Gdf3, Zfp42, Meis1, Hoxa2, and Hoxb1-emerged as candidates based on domain knowledge and were validated using qPCR in EBs perturbed by 39 test substances. We conducted correlation studies and utilized the variance inflation factor (VIF) to examine the existence of multicollinearity among the genes. Recursive feature elimination with cross-validation (RFECV) ranked Zfp42 and Hoxb1 as the top two among the seven features considered, identifying them as potential early embryotoxicity assessment biomarkers. As a result, a t test assessing the statistical significance of this two-feature prediction model yielded a p value of 0.0044, confirming the successful reduction of redundancies and multicollinearity through RFECV. Our study presents a systematic methodology for using machine learning techniques in transcriptomics data analysis, enhancing the discovery of potential reporter gene candidates for embryotoxicity screening research, and improving the predictive model's predictive accuracy and feasibility while reducing financial and time constraints.</p>","PeriodicalId":8329,"journal":{"name":"Archives of Toxicology","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of Toxicology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00204-024-03852-w","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TOXICOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Multicollinearity, characterized by significant co-expression patterns among genes, often occurs in high-throughput expression data, potentially impacting the predictive model's reliability. This study examined multicollinearity among closely related genes, particularly in RNA-Seq data obtained from embryoid bodies (EB) exposed to 5-fluorouracil perturbation to identify genes associated with embryotoxicity. Six genes-Dppa5a, Gdf3, Zfp42, Meis1, Hoxa2, and Hoxb1-emerged as candidates based on domain knowledge and were validated using qPCR in EBs perturbed by 39 test substances. We conducted correlation studies and utilized the variance inflation factor (VIF) to examine the existence of multicollinearity among the genes. Recursive feature elimination with cross-validation (RFECV) ranked Zfp42 and Hoxb1 as the top two among the seven features considered, identifying them as potential early embryotoxicity assessment biomarkers. As a result, a t test assessing the statistical significance of this two-feature prediction model yielded a p value of 0.0044, confirming the successful reduction of redundancies and multicollinearity through RFECV. Our study presents a systematic methodology for using machine learning techniques in transcriptomics data analysis, enhancing the discovery of potential reporter gene candidates for embryotoxicity screening research, and improving the predictive model's predictive accuracy and feasibility while reducing financial and time constraints.

Abstract Image

胚胎毒性的预测性生物标志物:减轻 RNA-Seq 中多重共线性的机器学习方法。
高通量表达数据中经常会出现多共线性现象,其特征是基因之间存在显著的共表达模式,这可能会影响预测模型的可靠性。本研究研究了密切相关基因之间的多重共线性,特别是暴露于5-氟尿嘧啶扰动的类胚体(EB)的RNA-Seq数据,以确定与胚胎毒性相关的基因。基于领域知识,六个基因-Dppa5a、Gdf3、Zfp42、Meis1、Hoxa2 和 Hoxb1 成为候选基因,并在受到 39 种测试物质干扰的 EB 中使用 qPCR 进行了验证。我们进行了相关性研究,并利用方差膨胀因子(VIF)检查了基因之间是否存在多重共线性。通过交叉验证的递归特征消除法(RFECV),Zfp42 和 Hoxb1 在所考虑的七个特征中排名前两位,被确定为潜在的早期胚胎毒性评估生物标记物。结果,评估该双特征预测模型统计意义的 t 检验得出的 p 值为 0.0044,证实通过 RFECV 成功地减少了冗余和多重共线性。我们的研究提出了一种在转录组学数据分析中使用机器学习技术的系统方法,有助于发现胚胎毒性筛选研究中潜在的候选报告基因,提高预测模型的预测准确性和可行性,同时减少资金和时间限制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Archives of Toxicology
Archives of Toxicology 医学-毒理学
CiteScore
11.60
自引率
4.90%
发文量
218
审稿时长
1.5 months
期刊介绍: Archives of Toxicology provides up-to-date information on the latest advances in toxicology. The journal places particular emphasis on studies relating to defined effects of chemicals and mechanisms of toxicity, including toxic activities at the molecular level, in humans and experimental animals. Coverage includes new insights into analysis and toxicokinetics and into forensic toxicology. Review articles of general interest to toxicologists are an additional important feature of the journal.
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信