Xueying Liu, Richard H Chapple, Declan Bennett, William C Wright, Ankita Sanjali, Erielle Culp, Yinwen Zhang, Min Pan, Paul Geeleher
{"title":"CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data.","authors":"Xueying Liu, Richard H Chapple, Declan Bennett, William C Wright, Ankita Sanjali, Erielle Culp, Yinwen Zhang, Min Pan, Paul Geeleher","doi":"10.1016/j.xgen.2024.100739","DOIUrl":null,"url":null,"abstract":"<p><p>Exploratory analysis of single-cell RNA sequencing (scRNA-seq) typically relies on hard clustering over two-dimensional projections like uniform manifold approximation and projection (UMAP). However, such methods can severely distort the data and have many arbitrary parameter choices. Methods that can model scRNA-seq data as non-discrete \"gene expression programs\" (GEPs) can better preserve the data's structure, but currently, they are often not scalable, not consistent across repeated runs, and lack an established method for choosing key parameters. Here, we developed a GPU-based unsupervised learning approach, \"consensus and scalable inference of gene expression programs\" (CSI-GEP). We show that CSI-GEP can recover ground truth GEPs in real and simulated atlas-scale scRNA-seq datasets, significantly outperforming cutting-edge methods, including GPT-based neural networks. We applied CSI-GEP to a whole mouse brain atlas of 2.2 million cells, disentangling endothelial cell types missed by other methods, and to an integrated scRNA-seq atlas of human tumors and cell lines, discovering mesenchymal-like GEPs unique to cancer cells growing in culture.</p>","PeriodicalId":72539,"journal":{"name":"Cell genomics","volume":"5 1","pages":"100739"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11770216/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cell genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.xgen.2024.100739","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CELL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Exploratory analysis of single-cell RNA sequencing (scRNA-seq) typically relies on hard clustering over two-dimensional projections like uniform manifold approximation and projection (UMAP). However, such methods can severely distort the data and have many arbitrary parameter choices. Methods that can model scRNA-seq data as non-discrete "gene expression programs" (GEPs) can better preserve the data's structure, but currently, they are often not scalable, not consistent across repeated runs, and lack an established method for choosing key parameters. Here, we developed a GPU-based unsupervised learning approach, "consensus and scalable inference of gene expression programs" (CSI-GEP). We show that CSI-GEP can recover ground truth GEPs in real and simulated atlas-scale scRNA-seq datasets, significantly outperforming cutting-edge methods, including GPT-based neural networks. We applied CSI-GEP to a whole mouse brain atlas of 2.2 million cells, disentangling endothelial cell types missed by other methods, and to an integrated scRNA-seq atlas of human tumors and cell lines, discovering mesenchymal-like GEPs unique to cancer cells growing in culture.