bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis.

bioRxiv : the preprint server for biology Pub Date : 2024-11-27 DOI:10.1101/2023.09.15.558026

Yizhou Peter Huang, Lauren Harmon, Eve Deering-Gardner, Xiaotu Ma, Josiah Harsh, Zhaoyu Xue, Hong Wen, Marcel Ramos, Sean Davis, Timothy J Triche

{"title":"bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis.","authors":"Yizhou Peter Huang, Lauren Harmon, Eve Deering-Gardner, Xiaotu Ma, Josiah Harsh, Zhaoyu Xue, Hong Wen, Marcel Ramos, Sean Davis, Timothy J Triche","doi":"10.1101/2023.09.15.558026","DOIUrl":null,"url":null,"abstract":"The NCI Genomic Data Commons (GDC) provides controlled access to sequencing data from thousands of subjects, enabling large-scale study of impactful genetic alterations such as simple and complex germline and structural variants. However, efficient analysis requires significant computational resources and expertise, especially when recalling variants from raw sequence reads. We thus developed bamSliceR , an R/Bioconductor package that builds upon the GenomicDataCommons package to extract aligned sequence reads from cross-GDC meta-cohorts, followed by targeted analysis of variants and effects (including transcript-aware variant annotation from transcriptome-aligned GDC RNA data). Here we demonstrate population-scale genomic & transcriptomic analyses with minimal compute burden via bamSliceR , identifying recurrent, clinically relevant sequence and structural variants in the TARGET AML and BEAT-AML cohorts. We then validate results in the (non-GDC) Leucegene cohort, demonstrating how the bamSliceR pipeline can be seamlessly applied to replicate findings in non-GDC cohorts. These variants directly yield clinically impactful and biologically testable hypotheses for mechanistic investigation. bamSliceR has been submitted to the Bioconductor project, where it is presently under review, and is available on GitHub at https://github.com/trichelab/bamSliceR.","PeriodicalId":72407,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516001/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.09.15.558026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The NCI Genomic Data Commons (GDC) provides controlled access to sequencing data from thousands of subjects, enabling large-scale study of impactful genetic alterations such as simple and complex germline and structural variants. However, efficient analysis requires significant computational resources and expertise, especially when recalling variants from raw sequence reads. We thus developed bamSliceR , an R/Bioconductor package that builds upon the GenomicDataCommons package to extract aligned sequence reads from cross-GDC meta-cohorts, followed by targeted analysis of variants and effects (including transcript-aware variant annotation from transcriptome-aligned GDC RNA data). Here we demonstrate population-scale genomic & transcriptomic analyses with minimal compute burden via bamSliceR , identifying recurrent, clinically relevant sequence and structural variants in the TARGET AML and BEAT-AML cohorts. We then validate results in the (non-GDC) Leucegene cohort, demonstrating how the bamSliceR pipeline can be seamlessly applied to replicate findings in non-GDC cohorts. These variants directly yield clinically impactful and biologically testable hypotheses for mechanistic investigation. bamSliceR has been submitted to the Bioconductor project, where it is presently under review, and is available on GitHub at https://github.com/trichelab/bamSliceR.

Abstract Image

查看原文本刊更多论文

bamSliccer：罕见变异和罕见疾病的跨队列变异和等位基因偏倚分析。

罕见的疾病和条件给基因流行病学家带来了独特的挑战，正是因为病例和样本稀少。近年来，全基因组和全转录组测序（WGS/WTS）缓解了对罕见遗传变异的研究。成对的WGS和WTS数据是理想的，但后勤和财务限制通常妨碍生成成对的WGS和WTS的数据。因此，许多数据库包含具有WGS或WTS数据的拼凑样本，但只有少数样本同时具有这两种数据。NCI基因组数据共享促进了数千名受试者对基因组和转录组数据的受控访问，其中许多受试者的测序结果不成对。对整个转录组中表达的变体进行局部再分析需要大量的数据存储、计算和专业知识。我们开发了bamSliccer包，以促进从比对序列读取到表达变体表征的快速转变。bamSliccer利用NCI基因组数据共享API查询通过强大的生物导体生态系统识别的样本的比对序列读取的基因组亚区。我们展示了如何以这种方式使用数量级更少的资源，以最小的计算负担，完成群体规模的靶向基因组分析。我们展示了bamSliceR在TARGET儿科AML和BEAT-AML项目中的试点结果，在这些项目中，识别罕见但复发的体细胞变异直接产生了可生物测试的假设。bamSliceR及其文档可在GitHub上免费获得，网址为https://github.com/trichelab/bamSliceR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量