An in-depth benchmark framework for evaluating single cell RNA-seq dropout imputation methods and the development of an improved algorithm afMF

IF 7.9 1区 医学 Q1 MEDICINE, RESEARCH & EXPERIMENTAL
Jinghan Huang, Anson C. M. Chow, Nelson L. S. Tang, Sheung Chi Phillip Yam
{"title":"An in-depth benchmark framework for evaluating single cell RNA-seq dropout imputation methods and the development of an improved algorithm afMF","authors":"Jinghan Huang,&nbsp;Anson C. M. Chow,&nbsp;Nelson L. S. Tang,&nbsp;Sheung Chi Phillip Yam","doi":"10.1002/ctm2.70283","DOIUrl":null,"url":null,"abstract":"<p>Dear editor,</p><p>The presence of the inflated zeros in single cell RNA-seq still represents a challenge. Imputation of zeros can be performed but it is not commonly used in real applications because of their uncertain benefits and the lack of in-depth benchmark for various downstream analyses. Here, we performed two tasks: an in-depth benchmark framework was developed to compare imputation algorithms; second, an improved algorithm, afMF, was developed. Our results indicated that matrix-theory-based algorithms such as afMF had great and stable performance across various applications and generally outperformed raw log-normalization and others. In contrast, complicated methods were prone to overfitting and data distortion.</p><p>Imputation has raised some discussions<span><sup>1, 2</sup></span>: downstream analyses could benefit from it,<sup>3–</sup><span><sup>5</sup></span> while false-positives may be introduced and zeros may contain important information too.<span><sup>6</sup></span> No definitive conclusion has been reached so far. Imputation algorithms have been developed for years. Meanwhile, several comparative studies for dropout imputation have been conducted<span><sup>1, 7-9</sup></span> but had several obvious issues: (1) lack of in-depth analysis, for example, automatic cell type annotation, pseudobulk DE analysis, GSEA, cell–cell communication, AUCell and SCENIC, integration with spatial transcriptomics, etc.; (2) limited number of datasets, dataset types and tested algorithms, that is, only less than 5 or 6 datasets were used and the evaluated algorithms were developed a few years ago; (3) using biased, unreasonable performance metrics or confined to basic summary statistics only; (4) confined to using many simulated datasets (which have been shown to be much simpler and cannot reflect the complexity of real data). These limitations are also complicated by lack of real datasets with given ground truth. At the moment, most of the imputation algorithms are not used in any real-world applications or only confined to be used in a limited number of downstream applications (e.g., cell type clustering). A more thorough benchmark of the compatibility between imputation and key downstream applications is required.</p><p>Here, we evaluated the compatibility between prior imputation algorithms and various downstream tasks. This issue is obvious when applying downstream algorithms that have in-situ imputation steps or are designed for sparse data, as prior imputation may be unnecessary or worsen the results. Some researchers used zero-inflated models instead of imputation but such methods may not perfectly fit for scRNA-seq.<span><sup>10</sup></span></p><p>Motivated by the benchmark review,<span><sup>2</sup></span> we developed an improved benchmark framework to address these issues by including previously well-established metrics and various novel features (Figure 1). These novel advantages includes: (1) using more than 25 real (mixture/purified cell type/time-course) or simulated datasets (Additional file 1 Table S1), which is much more than other benchmark studies; (2) including 21 top or new algorithms with acceptable scalability (Table S2), which is the most across various benchmark studies; (3) a pre-screening test to select algorithms for further evaluations; (4) visualizations (Gene Expression Violin plots; PCA/UMAP plots; Cell–Cell Correlations); (5) Differential Expression (DE) Analysis using Pseudobulk DE analysis; (6) Enrichment Analysis (GSEA); (7) Automatic Cell Type Annotation: SCINA and scType; (8) Pseudotime Trajectory Analysis using popular Monocle3, Slingshot and DPT; (9) AUCell and SCENIC regulatory analysis; (10) Cell–Cell Communication: CellPhoneDB and CellChat; (11) Integration of spatial transcriptomics with scRNA-seq (Seurat); (12) an improved imputation algorithm ‘afMF’ (<b>a</b>daptive <b>f</b>ull <b>M</b>atrix <b>F</b>actorization) (Method S1). These features had made our study exhaustive, unique, and novel.</p><p>afMF is an improved matrix-theory-based algorithm that builds upon another algorithm ‘ALRA’. afMF is different from ALRA in that an iterative process is used to optimize two low-rank matrices which may account for the added benefits shown in these evaluations. While ALRA employs randomized SVD, afMF applies a different way by utilizing full matrix factorization. Details of the algorithm could be found in Method S1 afMF.</p><p>To reduce the variabilities from other factors such as preprocessing, datasets, integrity of annotation that may influence the true impact of imputation, we (1) transformed the data to make all processed/imputed data in log space so that they are more comparable; (2) applied as many high-quality and various datasets as we can to reduce dataset bias; (3) used datasets with matched bulk or ‘gold standard’ annotations, for example, with wet lab experiment validations or CITE-seq with surface protein markers, or cell-cycle/time experiment with well-known checkpoint description. We compared various imputations with the well-established Seurat log-normalization where the data is also in log space. More introductions regarding dropout imputation are placed in Additional file 1 Note S1.</p><p>Based on our pre-screening results (Figure S1, Method S1), ten algorithms with stable performance (i.e., generally not worse than no-imputation across all pre-screening evaluations) were selected for further evaluations (Table S2), as performance largely worse than that no-imputation in any aspect may indicate strong introduction of unwanted patterns and data distortion.</p><p>The impact of imputation on the basic data analysis and visualizations was first explored (Additional file 2 Method S2 and Figures S2 to S5) and discussed in Note S2. Combining these results, only afMF, ALRA and scRMD did not fabricate artefactual structure in PCA and provided better visualizations in gene expression violin plots, 2-D PCA, and cell–cell correlations.</p><p>Differential expression (DE) between conditions was analyzed by three methods (Additional file 3 Method S3). Using MAST/rank sum test, higher <i>p</i>-value-based-rank concordance between bulk and afMF-imputed DE results were observed for all data types (Figure 2A; Figures S6 and S7). These conclusions held when limiting the genes to the top 1000 bulk DEGs (Figure S8). The top 500 DEGs showed greater statistical significance in afMF and other three algorithms (Figure 2A; Figures S6 and S7). Only afMF and I_Impute showed generally lower false positive rates in all types of data (Figure 2A and Figure S9). Higher logFC Spearman correlations were observed between bulk and afMF-imputed results (Figure S10). However, imputation is incompatible with pseudobulk analysis using limma-trend (Figure S11), which suggested that pseudobulk may work as smoothing and thus heavily decreased the dropout influence in DE analysis. Generally, enhancement for DE was only found in MAST/rank sum test using afMF, but not in pseudobulk. More descriptions and interpretations are placed in Note S3.</p><p>GSEA aims to study the enrichments of DEGs in specific biological pathways using results from DE analysis (Method S3). Using either MAST/rank sum sign -log<sub>10</sub>P or logFC as input, higher Spearman correlations between bulk and afMF-imputed GSEA were observed for all data types (Figure 2B; Figures S12 to S14). These conclusions held when limiting to enrichment terms with bulk <i>p</i> &lt; 0.05 (Figure S13). Regarding pseudobulk DE results, afMF and other four algorithms increased the correlations only when using logFC as input (Figure S15). Generally, afMF presented the most stable improvements. More descriptions are placed in Note S3.</p><p>Additional support for imputation can be gathered from cell sorting datasets or protein assay which may better reflect the ground-truth (Additional file 4 Method S4). Using datasets with matched bulk data, higher relative and absolute Spearman correlations between same-cell-type single cell/pseudobulk and bulk profiling, and correlations of the pairwise-cell-type logFC between pseudobulk and bulk, were observed for afMF-imputed data and most other imputed data (Figure 2C and Figure S16). Using a CITE-seq dataset with mRNA and surface-protein measurement, higher Spearman correlations between selected mRNA and surface-protein were observed in ALRA, afMF, etc. (Figure 2C and Figure S16). More descriptions are placed in Note S4.</p><p>Classification (Additional file 5 Method S5) is useful when predicting unknown labels. Using Random Forest model, we observed higher classification accuracy and correct-cell-type prediction probabilities in afMF and nearly all the other algorithms for all data types (Figure 2D; Figures S17 and S18). Marker genes can be used to identify different cell types (Method S5). Higher AUCs for marker genes to discriminate cell status were discovered for afMF and other five algorithms (Figure 2D; Figures S17 and S18). Only afMF, kNN-smoothing and ALRA enhanced the detection while controlling false-positive rate (Figure S18). Interestingly, the use of imputation for cell type annotation<span><sup>11</sup></span> may be underestimated (Method S5). Using two automatic cell type annotation tools SCINA and ScType, higher annotation accuracy, F1 scores and true prediction probabilities were observed for afMF and all other algorithms (Figure 2E and Figure S19) except for DCA. Notably, all of them resulted in lower unknown annotation rates. More descriptions are placed in Note S5.</p><p>Clustering is the essential step for exploring subtypes (Additional file 6 Method S6). Using Louvain and K-means algorithms with four clustering metrics, afMF and MAGIC-log showed improvements across all metrics compared to no-imputation (Figure 3A). In UMAP projection, afMF/ALRA/scRMD remained consistent structure of clusters, while others (e.g., kNN_smoothing) generated unexpected patterns (Figures S20 and S21). Cell-cycle dynamics have been well-studied (Method S6). Using a cell-cycle dataset with ground-truth, the prediction accuracy and statistical significance of the comparisons of predicted cell-cycle-scores between different known cell-cycles were improved when using afMF/ALRA, etc. (Figure 3B and Figure S22). More distinct separations between different cell-cycles were also observed using afMF and ALRA in 2-D UMAP (Figure 3C). More descriptions are placed in Note S6.</p><p>Trajectory inference enables the study of cell differentiation and development (Additional file 7 Method S7). When using DPT trajectory analysis, afMF and the other three algorithms improved the pseudotime analysis (i.e., correlations between the known time and predicted pseudotime and pseudo-temporal score) and branch predictions (Figure 3D). In diffusion map, afMF showed better continuum trajectories while others showed less improvements or distorted the patterns (Figure S23). For Monocle3 trajectory analysis, only MAGIC improved the analysis. While afMF and AutoClass showed slight improvements, other algorithms had no or negative influence. Slingshot trajectory analysis was incompatible with most imputation algorithms (Figures S24 and S25). More descriptions are placed in Note S7.</p><p>AUCell aims to investigate the activities of pathways (e.g., well-studied interferon (IFN)) in each cell (Additional file 8 Method S8). In afMF and ALRA-imputed data, increased percentages of monocytes with IFN response activated were only observed within COVID-19 subjects but not within healthy controls as expected (Figure 3E,F). In contrast, other algorithms showed false positives in controls. SCENIC incorporates AUCell for exploring gene regulatory networks. Seven well-established cell-type-specific ‘regulons’ were selected (Method S8). afMF and MAGIC performed well as they increased the percentages of cells with activated regulons within expected-cell-types while remained consistent levels as no-imputation within unrelated-cell-types (Figure 3G). The comparison of all the identified regulons (Z-score &gt; 3) across all the cell types revealed that most selected algorithms could recover the raw patterns but also added some unique significant regulons (Figure 3H and Figure S26). Of note, these newly generated significant regulons should be further validated through other experiments. More descriptions are placed in Note S8.</p><p>CellPhoneDB and CellChat are tools to study cell–cell communications (Method S8). Our results revealed abnormally huge increments of significant interactions after imputations in both CellPhoneDB and CellChat analysis (Figures S27 to S30). Though no ground truth is available for demonstration, they were believed to be the false positives as the patterns are abnormal and many of the interactions are unique. More descriptions are placed in Note S8.</p><p>Integrating with scRNA-seq data is an important step to study spatial transcriptomics. Using Seurat integration pipeline, we observed a clear recovery of the known spatial localization patterns of both neuronal and non-neuronal subsets with raw SC-Transform data and MAGIC-imputed data as reference (Figure S31). In contrast, ALRA/AutoClass/afMF either led to much weaker pattern (e.g., L4 and L5 PT/IT regions) or raised errors due to the incompatibility between algorithms. More descriptions are placed in Note S8.</p><p>Real scRNA-seq data are complicated and difficult to simulate but simulated datasets have the advantage of having ground-truth. Using simulated datasets generated from Splatter/SplatPop (Additional file 9 Method S9), we found afMF and ALRA performed generally better in most of the analyses (Figures S32 to S35). More descriptions are placed in Note S9.</p><p>Good algorithms should have acceptable running-time and memory-usage (Additional file 10 Method S10), and most selected algorithms meet the requirements except for I-Impute (running-time) (Figure 4A) and ccImpute and Bfimpute (memory-usage on large datasets) (Figure 4B). Performances of algorithms were rated by comparing with no-imputation. Generally, matrix-theory-based methods such as afMF and ALRA improved the various task performances steadily, while others showed less or no improvements or were incompatible with some downstream tools (Figure 4C,D; Tables S3 and S4). Specifically, afMF ranked among the top algorithms in various evaluations, for example, DE analysis, GSEA, classification, biomarker prediction, automatic cell type annotation, clustering, DPT trajectory analysis, AUCell and SCENIC, SC-bulk profiling similarity and mRNA-surface protein correlation. Within the top matrix-theory algorithms, afMF outperformed ALRA in multiple evaluations (cell-level DE analysis, GSEA, classification, biomarker prediction, clustering and SC-bulk profiling similarity) (Table S4). Besides, MAGIC (smoothing) and AutoClass (deep-learning) also showed some enhanced output in selected applications but produced false positives in other applications. We also found that most imputations are not compatible with certain downstream algorithms, for example, cell–cell communication and pseudobulk-limma DE analysis.</p><p>In this study, we developed an exhaustive benchmark framework for scRNA-seq imputations and an improved algorithm afMF to handle dropouts. afMF had great and stable performance while kept acceptable scalability. We hope these works can enhance the use of imputation in various downstream tasks as a complement to raw data analysis, and further promote new discoveries. We also have some further discussions in Note S10.</p><p>Jinghan Huang contributed to the design of the work, performed data curation, analysis, interpretation of data, and the creation of new software used in the work, and drafted and revised the work. Anson C. M. Chow contributed to the design of the algorithm and the creation of new software used in the work. Nelson L. S. Tang conceived the research, contributed to the design of the work, interpretation of data, and drafted and revised the work. Sheung Chi Phillip Yam contributed to the conception and design of the algorithm. All authors read and approved the final manuscript.</p><p>NT is founding Director and shareholder of the biotechstartup company, Cytomics Ltd, in Hong Kong Science Park. AC was an part-time employee of Cytomics Ltd during the development of this software afMF.</p><p>Phillip Yam acknowledges the financial supports from HKGRF-14301321 with the project title “General Theory for InfiniteDimensional Stochastic Control: Mean Field and Some Classical Problems” and HKGRF-14300123 with the project title“Well-posedness of SomePoisson-driven Mean Field Learning Models and their Applications”. He is also supported by a grant from the Germany/Hong Kong Joint Research Scheme funded by the Research Grants Council of Hong Kong and the German Academic Exchange Service of Germany (G-CUHK411/23) and visiting professor supported by The University of Texas at Dallas, Naveen Jindal School of Management.</p><p>The authors have nothing to report.</p>","PeriodicalId":10189,"journal":{"name":"Clinical and Translational Medicine","volume":"15 4","pages":""},"PeriodicalIF":7.9000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ctm2.70283","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Translational Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ctm2.70283","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Dear editor,

The presence of the inflated zeros in single cell RNA-seq still represents a challenge. Imputation of zeros can be performed but it is not commonly used in real applications because of their uncertain benefits and the lack of in-depth benchmark for various downstream analyses. Here, we performed two tasks: an in-depth benchmark framework was developed to compare imputation algorithms; second, an improved algorithm, afMF, was developed. Our results indicated that matrix-theory-based algorithms such as afMF had great and stable performance across various applications and generally outperformed raw log-normalization and others. In contrast, complicated methods were prone to overfitting and data distortion.

Imputation has raised some discussions1, 2: downstream analyses could benefit from it,3–5 while false-positives may be introduced and zeros may contain important information too.6 No definitive conclusion has been reached so far. Imputation algorithms have been developed for years. Meanwhile, several comparative studies for dropout imputation have been conducted1, 7-9 but had several obvious issues: (1) lack of in-depth analysis, for example, automatic cell type annotation, pseudobulk DE analysis, GSEA, cell–cell communication, AUCell and SCENIC, integration with spatial transcriptomics, etc.; (2) limited number of datasets, dataset types and tested algorithms, that is, only less than 5 or 6 datasets were used and the evaluated algorithms were developed a few years ago; (3) using biased, unreasonable performance metrics or confined to basic summary statistics only; (4) confined to using many simulated datasets (which have been shown to be much simpler and cannot reflect the complexity of real data). These limitations are also complicated by lack of real datasets with given ground truth. At the moment, most of the imputation algorithms are not used in any real-world applications or only confined to be used in a limited number of downstream applications (e.g., cell type clustering). A more thorough benchmark of the compatibility between imputation and key downstream applications is required.

Here, we evaluated the compatibility between prior imputation algorithms and various downstream tasks. This issue is obvious when applying downstream algorithms that have in-situ imputation steps or are designed for sparse data, as prior imputation may be unnecessary or worsen the results. Some researchers used zero-inflated models instead of imputation but such methods may not perfectly fit for scRNA-seq.10

Motivated by the benchmark review,2 we developed an improved benchmark framework to address these issues by including previously well-established metrics and various novel features (Figure 1). These novel advantages includes: (1) using more than 25 real (mixture/purified cell type/time-course) or simulated datasets (Additional file 1 Table S1), which is much more than other benchmark studies; (2) including 21 top or new algorithms with acceptable scalability (Table S2), which is the most across various benchmark studies; (3) a pre-screening test to select algorithms for further evaluations; (4) visualizations (Gene Expression Violin plots; PCA/UMAP plots; Cell–Cell Correlations); (5) Differential Expression (DE) Analysis using Pseudobulk DE analysis; (6) Enrichment Analysis (GSEA); (7) Automatic Cell Type Annotation: SCINA and scType; (8) Pseudotime Trajectory Analysis using popular Monocle3, Slingshot and DPT; (9) AUCell and SCENIC regulatory analysis; (10) Cell–Cell Communication: CellPhoneDB and CellChat; (11) Integration of spatial transcriptomics with scRNA-seq (Seurat); (12) an improved imputation algorithm ‘afMF’ (adaptive full Matrix Factorization) (Method S1). These features had made our study exhaustive, unique, and novel.

afMF is an improved matrix-theory-based algorithm that builds upon another algorithm ‘ALRA’. afMF is different from ALRA in that an iterative process is used to optimize two low-rank matrices which may account for the added benefits shown in these evaluations. While ALRA employs randomized SVD, afMF applies a different way by utilizing full matrix factorization. Details of the algorithm could be found in Method S1 afMF.

To reduce the variabilities from other factors such as preprocessing, datasets, integrity of annotation that may influence the true impact of imputation, we (1) transformed the data to make all processed/imputed data in log space so that they are more comparable; (2) applied as many high-quality and various datasets as we can to reduce dataset bias; (3) used datasets with matched bulk or ‘gold standard’ annotations, for example, with wet lab experiment validations or CITE-seq with surface protein markers, or cell-cycle/time experiment with well-known checkpoint description. We compared various imputations with the well-established Seurat log-normalization where the data is also in log space. More introductions regarding dropout imputation are placed in Additional file 1 Note S1.

Based on our pre-screening results (Figure S1, Method S1), ten algorithms with stable performance (i.e., generally not worse than no-imputation across all pre-screening evaluations) were selected for further evaluations (Table S2), as performance largely worse than that no-imputation in any aspect may indicate strong introduction of unwanted patterns and data distortion.

The impact of imputation on the basic data analysis and visualizations was first explored (Additional file 2 Method S2 and Figures S2 to S5) and discussed in Note S2. Combining these results, only afMF, ALRA and scRMD did not fabricate artefactual structure in PCA and provided better visualizations in gene expression violin plots, 2-D PCA, and cell–cell correlations.

Differential expression (DE) between conditions was analyzed by three methods (Additional file 3 Method S3). Using MAST/rank sum test, higher p-value-based-rank concordance between bulk and afMF-imputed DE results were observed for all data types (Figure 2A; Figures S6 and S7). These conclusions held when limiting the genes to the top 1000 bulk DEGs (Figure S8). The top 500 DEGs showed greater statistical significance in afMF and other three algorithms (Figure 2A; Figures S6 and S7). Only afMF and I_Impute showed generally lower false positive rates in all types of data (Figure 2A and Figure S9). Higher logFC Spearman correlations were observed between bulk and afMF-imputed results (Figure S10). However, imputation is incompatible with pseudobulk analysis using limma-trend (Figure S11), which suggested that pseudobulk may work as smoothing and thus heavily decreased the dropout influence in DE analysis. Generally, enhancement for DE was only found in MAST/rank sum test using afMF, but not in pseudobulk. More descriptions and interpretations are placed in Note S3.

GSEA aims to study the enrichments of DEGs in specific biological pathways using results from DE analysis (Method S3). Using either MAST/rank sum sign -log10P or logFC as input, higher Spearman correlations between bulk and afMF-imputed GSEA were observed for all data types (Figure 2B; Figures S12 to S14). These conclusions held when limiting to enrichment terms with bulk p < 0.05 (Figure S13). Regarding pseudobulk DE results, afMF and other four algorithms increased the correlations only when using logFC as input (Figure S15). Generally, afMF presented the most stable improvements. More descriptions are placed in Note S3.

Additional support for imputation can be gathered from cell sorting datasets or protein assay which may better reflect the ground-truth (Additional file 4 Method S4). Using datasets with matched bulk data, higher relative and absolute Spearman correlations between same-cell-type single cell/pseudobulk and bulk profiling, and correlations of the pairwise-cell-type logFC between pseudobulk and bulk, were observed for afMF-imputed data and most other imputed data (Figure 2C and Figure S16). Using a CITE-seq dataset with mRNA and surface-protein measurement, higher Spearman correlations between selected mRNA and surface-protein were observed in ALRA, afMF, etc. (Figure 2C and Figure S16). More descriptions are placed in Note S4.

Classification (Additional file 5 Method S5) is useful when predicting unknown labels. Using Random Forest model, we observed higher classification accuracy and correct-cell-type prediction probabilities in afMF and nearly all the other algorithms for all data types (Figure 2D; Figures S17 and S18). Marker genes can be used to identify different cell types (Method S5). Higher AUCs for marker genes to discriminate cell status were discovered for afMF and other five algorithms (Figure 2D; Figures S17 and S18). Only afMF, kNN-smoothing and ALRA enhanced the detection while controlling false-positive rate (Figure S18). Interestingly, the use of imputation for cell type annotation11 may be underestimated (Method S5). Using two automatic cell type annotation tools SCINA and ScType, higher annotation accuracy, F1 scores and true prediction probabilities were observed for afMF and all other algorithms (Figure 2E and Figure S19) except for DCA. Notably, all of them resulted in lower unknown annotation rates. More descriptions are placed in Note S5.

Clustering is the essential step for exploring subtypes (Additional file 6 Method S6). Using Louvain and K-means algorithms with four clustering metrics, afMF and MAGIC-log showed improvements across all metrics compared to no-imputation (Figure 3A). In UMAP projection, afMF/ALRA/scRMD remained consistent structure of clusters, while others (e.g., kNN_smoothing) generated unexpected patterns (Figures S20 and S21). Cell-cycle dynamics have been well-studied (Method S6). Using a cell-cycle dataset with ground-truth, the prediction accuracy and statistical significance of the comparisons of predicted cell-cycle-scores between different known cell-cycles were improved when using afMF/ALRA, etc. (Figure 3B and Figure S22). More distinct separations between different cell-cycles were also observed using afMF and ALRA in 2-D UMAP (Figure 3C). More descriptions are placed in Note S6.

Trajectory inference enables the study of cell differentiation and development (Additional file 7 Method S7). When using DPT trajectory analysis, afMF and the other three algorithms improved the pseudotime analysis (i.e., correlations between the known time and predicted pseudotime and pseudo-temporal score) and branch predictions (Figure 3D). In diffusion map, afMF showed better continuum trajectories while others showed less improvements or distorted the patterns (Figure S23). For Monocle3 trajectory analysis, only MAGIC improved the analysis. While afMF and AutoClass showed slight improvements, other algorithms had no or negative influence. Slingshot trajectory analysis was incompatible with most imputation algorithms (Figures S24 and S25). More descriptions are placed in Note S7.

AUCell aims to investigate the activities of pathways (e.g., well-studied interferon (IFN)) in each cell (Additional file 8 Method S8). In afMF and ALRA-imputed data, increased percentages of monocytes with IFN response activated were only observed within COVID-19 subjects but not within healthy controls as expected (Figure 3E,F). In contrast, other algorithms showed false positives in controls. SCENIC incorporates AUCell for exploring gene regulatory networks. Seven well-established cell-type-specific ‘regulons’ were selected (Method S8). afMF and MAGIC performed well as they increased the percentages of cells with activated regulons within expected-cell-types while remained consistent levels as no-imputation within unrelated-cell-types (Figure 3G). The comparison of all the identified regulons (Z-score > 3) across all the cell types revealed that most selected algorithms could recover the raw patterns but also added some unique significant regulons (Figure 3H and Figure S26). Of note, these newly generated significant regulons should be further validated through other experiments. More descriptions are placed in Note S8.

CellPhoneDB and CellChat are tools to study cell–cell communications (Method S8). Our results revealed abnormally huge increments of significant interactions after imputations in both CellPhoneDB and CellChat analysis (Figures S27 to S30). Though no ground truth is available for demonstration, they were believed to be the false positives as the patterns are abnormal and many of the interactions are unique. More descriptions are placed in Note S8.

Integrating with scRNA-seq data is an important step to study spatial transcriptomics. Using Seurat integration pipeline, we observed a clear recovery of the known spatial localization patterns of both neuronal and non-neuronal subsets with raw SC-Transform data and MAGIC-imputed data as reference (Figure S31). In contrast, ALRA/AutoClass/afMF either led to much weaker pattern (e.g., L4 and L5 PT/IT regions) or raised errors due to the incompatibility between algorithms. More descriptions are placed in Note S8.

Real scRNA-seq data are complicated and difficult to simulate but simulated datasets have the advantage of having ground-truth. Using simulated datasets generated from Splatter/SplatPop (Additional file 9 Method S9), we found afMF and ALRA performed generally better in most of the analyses (Figures S32 to S35). More descriptions are placed in Note S9.

Good algorithms should have acceptable running-time and memory-usage (Additional file 10 Method S10), and most selected algorithms meet the requirements except for I-Impute (running-time) (Figure 4A) and ccImpute and Bfimpute (memory-usage on large datasets) (Figure 4B). Performances of algorithms were rated by comparing with no-imputation. Generally, matrix-theory-based methods such as afMF and ALRA improved the various task performances steadily, while others showed less or no improvements or were incompatible with some downstream tools (Figure 4C,D; Tables S3 and S4). Specifically, afMF ranked among the top algorithms in various evaluations, for example, DE analysis, GSEA, classification, biomarker prediction, automatic cell type annotation, clustering, DPT trajectory analysis, AUCell and SCENIC, SC-bulk profiling similarity and mRNA-surface protein correlation. Within the top matrix-theory algorithms, afMF outperformed ALRA in multiple evaluations (cell-level DE analysis, GSEA, classification, biomarker prediction, clustering and SC-bulk profiling similarity) (Table S4). Besides, MAGIC (smoothing) and AutoClass (deep-learning) also showed some enhanced output in selected applications but produced false positives in other applications. We also found that most imputations are not compatible with certain downstream algorithms, for example, cell–cell communication and pseudobulk-limma DE analysis.

In this study, we developed an exhaustive benchmark framework for scRNA-seq imputations and an improved algorithm afMF to handle dropouts. afMF had great and stable performance while kept acceptable scalability. We hope these works can enhance the use of imputation in various downstream tasks as a complement to raw data analysis, and further promote new discoveries. We also have some further discussions in Note S10.

Jinghan Huang contributed to the design of the work, performed data curation, analysis, interpretation of data, and the creation of new software used in the work, and drafted and revised the work. Anson C. M. Chow contributed to the design of the algorithm and the creation of new software used in the work. Nelson L. S. Tang conceived the research, contributed to the design of the work, interpretation of data, and drafted and revised the work. Sheung Chi Phillip Yam contributed to the conception and design of the algorithm. All authors read and approved the final manuscript.

NT is founding Director and shareholder of the biotechstartup company, Cytomics Ltd, in Hong Kong Science Park. AC was an part-time employee of Cytomics Ltd during the development of this software afMF.

Phillip Yam acknowledges the financial supports from HKGRF-14301321 with the project title “General Theory for InfiniteDimensional Stochastic Control: Mean Field and Some Classical Problems” and HKGRF-14300123 with the project title“Well-posedness of SomePoisson-driven Mean Field Learning Models and their Applications”. He is also supported by a grant from the Germany/Hong Kong Joint Research Scheme funded by the Research Grants Council of Hong Kong and the German Academic Exchange Service of Germany (G-CUHK411/23) and visiting professor supported by The University of Texas at Dallas, Naveen Jindal School of Management.

The authors have nothing to report.

Abstract Image

一个评估单细胞RNA-seq drop - imputation方法的深入基准框架和改进算法afMF的开发。
亲爱的编辑,单细胞RNA-seq中存在的膨胀零仍然是一个挑战。可以执行零的插入,但由于其不确定的收益和缺乏对各种下游分析的深入基准,因此在实际应用中并不常用。在这里,我们完成了两项任务:开发了一个深入的基准框架来比较插值算法;其次,提出了一种改进的afMF算法。我们的研究结果表明,基于矩阵理论的算法(如afMF)在各种应用中具有出色而稳定的性能,并且通常优于原始对数归一化和其他算法。而复杂的方法容易出现过拟合和数据失真。Imputation已经引起了一些讨论1,2:下游分析可以从中受益,3-5,而假阳性可能被引入,零也可能包含重要的信息6到目前为止还没有得出明确的结论。代入算法已经发展了很多年。与此同时,一些关于dropout imputation的比较研究已经开展了1,7-9,但存在几个明显的问题:(1)缺乏深入的分析,如自动细胞类型注释、伪体DE分析、GSEA、细胞-细胞通讯、AUCell和SCENIC、与空间转录组学的整合等;(2)数据集、数据集类型和测试算法的数量有限,即使用的数据集不到5或6个,所评估的算法是几年前开发的;(三)使用有偏差、不合理的绩效指标或者仅局限于基本汇总统计的;(4)局限于使用许多模拟数据集(这些数据集已被证明要简单得多,不能反映真实数据的复杂性)。这些限制也因为缺乏真实的数据集而变得复杂。目前,大多数的输入算法并没有在任何实际应用中使用,或者只局限于有限数量的下游应用(例如,细胞类型聚类)。需要对输入和关键下游应用程序之间的兼容性进行更彻底的基准测试。在这里,我们评估了之前的imputation算法和各种下游任务之间的兼容性。当应用具有原位插入步骤或为稀疏数据设计的下游算法时,这个问题很明显,因为先前的插入可能是不必要的或使结果恶化。一些研究人员使用零膨胀模型代替imputation,但这种方法可能不完全适合scRNA-seq。在基准审查的激励下,我们开发了一个改进的基准框架,通过包括先前建立的指标和各种新颖特征来解决这些问题(图1)。这些新颖的优势包括:(1)使用超过25个真实(混合/纯化细胞类型/时间过程)或模拟数据集(附加文件1表S1),这比其他基准研究要多得多;(2)包括21个具有可接受可扩展性的顶级或新算法(表S2),这是各基准研究中最多的;(3)预筛选测试,选择算法进行进一步评估;(4)可视化(基因表达小提琴情节;PCA / UMAP情节;信息相关性);(5)采用伪体DE分析进行差异表达(DE)分析;(6)富集分析(GSEA);(7)自动细胞类型标注:SCINA和scType;(8)使用Monocle3、Slingshot和DPT进行伪时间轨迹分析;(9) AUCell和SCENIC监管分析;(10)细胞-细胞通讯:CellPhoneDB和CellChat;(11)空间转录组学与scRNA-seq的整合(Seurat);(12)改进的自适应全矩阵分解(afMF)算法(方法S1)。这些特点使我们的研究详尽、独特、新颖。afMF是一种改进的基于矩阵理论的算法,它建立在另一种算法“ALRA”之上。afMF与ALRA的不同之处在于,它使用迭代过程来优化两个低秩矩阵,这可能解释了这些评估中显示的附加效益。ALRA采用随机奇异值分解,而afMF则采用完全矩阵分解。算法的细节可以在方法S1 afMF中找到。为了减少其他因素(如预处理、数据集、注释完整性等)可能影响imputation的真实效果的可变性,我们(1)对数据进行转换,使所有处理/ imputation的数据都在日志空间中,使它们更具可比性;(2)应用尽可能多的高质量和多样化的数据集,以减少数据集偏差;(3)使用具有匹配的批量或“金标准”注释的数据集,例如,湿实验室实验验证或具有表面蛋白质标记的CITE-seq,或具有众所周知的检查点描述的细胞周期/时间实验。我们比较了各种imputation与完善的Seurat日志归一化,其中数据也在日志空间。 我们还发现,大多数归算与某些下游算法不兼容,例如,细胞-细胞通信和伪体积-极限DE分析。在本研究中,我们开发了一个详尽的scRNA-seq估算基准框架,并改进了afMF算法来处理缺失。在保持可扩展性的同时,afMF具有良好稳定的性能。我们希望这些工作可以加强在各种下游任务中的应用,作为原始数据分析的补充,并进一步促进新发现。我们在Note S10上也有一些进一步的讨论。黄景涵参与了作品的设计,完成了作品的数据整理、数据分析、解读、新软件的创建,并对作品进行了起草和修改。Anson C. M. Chow对算法的设计和工作中使用的新软件的创建做出了贡献。Nelson L. S. Tang构思了这项研究,参与了研究的设计、数据的解释,并起草和修改了研究。任尚志对算法的构思和设计做出了贡献。所有作者都阅读并批准了最终的手稿。他是香港科学园生物科技创业公司巨细胞科技有限公司的创始董事和股东。在开发afMF软件期间,AC是Cytomics Ltd的兼职员工。任志刚获HKGRF-14301321及HKGRF-14300123资助,项目名称为“无限维随机控制的一般理论:平均场和一些经典问题”,项目名称为“一些泊松驱动的平均场学习模型的适定性及其应用”。他亦获香港研究资助局及德国学术交流中心资助的德港联合研究计划(G-CUHK411/23)资助,并获美国德州大学达拉斯分校纳文金达尔管理学院客座教授资助。作者没有什么可报告的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
15.90
自引率
1.90%
发文量
450
审稿时长
4 weeks
期刊介绍: Clinical and Translational Medicine (CTM) is an international, peer-reviewed, open-access journal dedicated to accelerating the translation of preclinical research into clinical applications and fostering communication between basic and clinical scientists. It highlights the clinical potential and application of various fields including biotechnologies, biomaterials, bioengineering, biomarkers, molecular medicine, omics science, bioinformatics, immunology, molecular imaging, drug discovery, regulation, and health policy. With a focus on the bench-to-bedside approach, CTM prioritizes studies and clinical observations that generate hypotheses relevant to patients and diseases, guiding investigations in cellular and molecular medicine. The journal encourages submissions from clinicians, researchers, policymakers, and industry professionals.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信