AnnSQL: a Python SQL-based package for fast large-scale single-cell genomics analysis using minimal computational resources.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-05-05 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf105

Kenny Pavan, Arpiar Saunders

{"title":"AnnSQL: a Python SQL-based package for fast large-scale single-cell genomics analysis using minimal computational resources.","authors":"Kenny Pavan, Arpiar Saunders","doi":"10.1093/bioadv/vbaf105","DOIUrl":null,"url":null,"abstract":"Summary: As single-cell genomics technologies continue to accelerate biological discovery, software tools that use elegant syntax and minimal computational resources to analyze atlas-scale datasets are increasingly needed. Here, we introduce AnnSQL, a Python package that constructs an AnnData-inspired database using the in-process DuckDb engine, enabling orders-of-magnitude performance enhancements for parsing single-cell genomics datasets with the ease of SQL. We highlight AnnSQL functionality and demonstrate transformative runtime improvements by comparing AnnData or AnnSQL operations on a 4.4 million cell single-nucleus RNA-seq dataset: AnnSQL-based operations were executed in minutes on a laptop for which equivalent operations in AnnData or Seurat largely failed (or were ∼700× slower) on a high-performance computing cluster. AnnSQL lowers computational barriers for large-scale single-cell/nucleus RNA-seq analysis on a personal computer, while demonstrating a promising computational infrastructure extendable for complete single-cell workflows across various genome-wide measurements.Availability and implementation: AnnSQL is a pip installable package that can be found at https://github.com/ArpiarSaundersLab/annsql along with documentation at https://docs.annsql.com.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf105"},"PeriodicalIF":2.4000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12098940/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Summary: As single-cell genomics technologies continue to accelerate biological discovery, software tools that use elegant syntax and minimal computational resources to analyze atlas-scale datasets are increasingly needed. Here, we introduce AnnSQL, a Python package that constructs an AnnData-inspired database using the in-process DuckDb engine, enabling orders-of-magnitude performance enhancements for parsing single-cell genomics datasets with the ease of SQL. We highlight AnnSQL functionality and demonstrate transformative runtime improvements by comparing AnnData or AnnSQL operations on a 4.4 million cell single-nucleus RNA-seq dataset: AnnSQL-based operations were executed in minutes on a laptop for which equivalent operations in AnnData or Seurat largely failed (or were ∼700× slower) on a high-performance computing cluster. AnnSQL lowers computational barriers for large-scale single-cell/nucleus RNA-seq analysis on a personal computer, while demonstrating a promising computational infrastructure extendable for complete single-cell workflows across various genome-wide measurements.

Availability and implementation: AnnSQL is a pip installable package that can be found at https://github.com/ArpiarSaundersLab/annsql along with documentation at https://docs.annsql.com.

查看原文本刊更多论文

AnnSQL：一个基于Python sql的包，用于使用最少的计算资源进行快速大规模单细胞基因组分析。

摘要：随着单细胞基因组学技术不断加速生物发现，越来越需要使用优雅语法和最小计算资源来分析atlas规模数据集的软件工具。在这里，我们将介绍AnnSQL，这是一个Python包，它使用进程内DuckDb引擎构建了一个受anndata启发的数据库，通过SQL的易用性，可以实现数量级的性能增强，从而解析单细胞基因组数据集。我们强调了AnnSQL的功能，并通过比较AnnData或AnnSQL在440万细胞单核RNA-seq数据集上的操作来展示变变性的运行时改进：基于AnnSQL的操作在笔记本电脑上几分钟内执行，而在高性能计算集群上，AnnData或Seurat中的等效操作基本上失败（或慢约700倍）。AnnSQL降低了在个人计算机上进行大规模单细胞/细胞核RNA-seq分析的计算障碍，同时展示了一个有前途的计算基础设施，可扩展到跨各种全基因组测量的完整单细胞工作流程。可用性和实现：AnnSQL是一个pip可安装包，可以在https://github.com/ArpiarSaundersLab/annsql上找到，文档在https://docs.annsql.com上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量