AnnSQL: A Python SQL-based package for fast large-scale single-cell genomics analysis using minimal computational resources.

bioRxiv : the preprint server for biology Pub Date : 2025-03-22 DOI:10.1101/2024.11.02.621676

Kenny Pavan, Arpiar Saunders

{"title":"AnnSQL: A Python SQL-based package for fast large-scale single-cell genomics analysis using minimal computational resources.","authors":"Kenny Pavan, Arpiar Saunders","doi":"10.1101/2024.11.02.621676","DOIUrl":null,"url":null,"abstract":"As single-cell genomics technologies continue to accelerate biological discovery, software tools that use elegant syntax and minimal computational resources to analyze atlas-scale datasets are increasingly needed. Here we introduce AnnSQL, a Python package that constructs an AnnData-inspired database using the in-process DuckDb engine, enabling orders-of-magnitude performance enhancements for parsing single-cell genomics datasets with the ease of SQL. We highlight AnnSQL functionality and demonstrate transformative runtime improvements by comparing AnnData or AnnSQL operations on a 4.4 million cell single-nucleus RNA-seq dataset: AnnSQL-based operations were executed in minutes on a laptop for which equivalent AnnData operations largely failed (or were ∼700x slower) on a high-performance computing cluster. AnnSQL lowers computational barriers for large-scale single-cell/nucleus RNA-seq analysis on a personal computer, while demonstrating a promising computational infrastructure extendable for complete single-cell workflows across various genome-wide measurements.Availability and implementation: AnnSQL is a pip installable package that can be found at https://github.com/ArpiarSaundersLab/annsql along with documentation at https://docs.annsql.com .","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11661128/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.11.02.621676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As single-cell genomics technologies continue to accelerate biological discovery, software tools that use elegant syntax and minimal computational resources to analyze atlas-scale datasets are increasingly needed. Here we introduce AnnSQL, a Python package that constructs an AnnData-inspired database using the in-process DuckDb engine, enabling orders-of-magnitude performance enhancements for parsing single-cell genomics datasets with the ease of SQL. We highlight AnnSQL functionality and demonstrate transformative runtime improvements by comparing AnnData or AnnSQL operations on a 4.4 million cell single-nucleus RNA-seq dataset: AnnSQL-based operations were executed in minutes on a laptop for which equivalent AnnData operations largely failed (or were ∼700x slower) on a high-performance computing cluster. AnnSQL lowers computational barriers for large-scale single-cell/nucleus RNA-seq analysis on a personal computer, while demonstrating a promising computational infrastructure extendable for complete single-cell workflows across various genome-wide measurements.

Availability and implementation: AnnSQL is a pip installable package that can be found at https://github.com/ArpiarSaundersLab/annsql along with documentation at https://docs.annsql.com .

查看原文本刊更多论文

AnnSQL：一个基于Python sql的包，用于笔记本电脑上的大规模单细胞基因组分析。

随着单细胞基因组学技术不断加速生物发现，越来越需要使用优雅语法和最小计算资源来分析atlas规模数据集的软件工具。在这里，我们介绍一个Python包AnnSQL，它使用进程内DuckDb引擎构建了一个受anndata启发的数据库，通过SQL的易用性实现了数量级的性能增强，可以解析单细胞基因组数据集。我们强调了AnnSQL的功能，并通过比较440万细胞单核RNA-seq数据集上的AnnData或AnnSQL操作来展示变换器运行时的改进：在笔记本电脑上，基于AnnSQL的操作在几分钟内执行，而在高性能计算集群上，等效的AnnData操作基本上失败（或慢约700倍）。AnnSQL降低了在个人计算机上进行大规模单细胞/细胞核RNA-seq分析的计算障碍，同时展示了一个有前途的计算基础设施，可扩展到跨各种全基因组测量的完整单细胞工作流程。可用性和实现：AnnSQL是一个pip可安装包，可以通过使用文档访问：https://github.com/ArpiarSaundersLab/annsql。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量