全基因组测序研究的生物统计方面:预处理和质量控制

IF 16.4 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler
{"title":"全基因组测序研究的生物统计方面:预处理和质量控制","authors":"Raphael O. Betschart,&nbsp;Cristian Riccio,&nbsp;Domingo Aguilera-Garcia,&nbsp;Stefan Blankenberg,&nbsp;Linlin Guo,&nbsp;Holger Moch,&nbsp;Dagmar Seidl,&nbsp;Hugo Solleder,&nbsp;Felix Thalén,&nbsp;Alexandre Thiéry,&nbsp;Raphael Twerenbold,&nbsp;Tanja Zeller,&nbsp;Martin Zoche,&nbsp;Andreas Ziegler","doi":"10.1002/bimj.202300278","DOIUrl":null,"url":null,"abstract":"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278","citationCount":"0","resultStr":"{\"title\":\"Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control\",\"authors\":\"Raphael O. Betschart,&nbsp;Cristian Riccio,&nbsp;Domingo Aguilera-Garcia,&nbsp;Stefan Blankenberg,&nbsp;Linlin Guo,&nbsp;Holger Moch,&nbsp;Dagmar Seidl,&nbsp;Hugo Solleder,&nbsp;Felix Thalén,&nbsp;Alexandre Thiéry,&nbsp;Raphael Twerenbold,&nbsp;Tanja Zeller,&nbsp;Martin Zoche,&nbsp;Andreas Ziegler\",\"doi\":\"10.1002/bimj.202300278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2024-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/bimj.202300278\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bimj.202300278","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

高通量 DNA 测序技术的飞速发展促成了大规模的全基因组测序(WGS)研究。在进行表型与基因型之间的关联分析之前,需要对原始序列数据进行预处理和质量控制(QC)。由于许多生物统计学家至今尚未接触过 WGS 数据,因此我们首先简要介绍了 Illumina 的短线程测序技术。其次,我们解释了 WGS 研究的一般预处理流程。第三,我们概述了应用于 WGS 数据的重要 QC 指标:原始数据、映射和比对后、变异调用后和多样本变异调用后。第四,我们用汉堡-达沃斯基因测序研究(GENESIS-HD)的数据来说明质量控制,这项研究涉及 9000 多个人类全基因组。所有样本均在 Illumina NovaSeq 6000 上进行测序,采用无 PCR 方案,平均覆盖率为 35×。为了进行质量控制,对一个瓶中基因组(GIAB)三组进行了四次重复测序,一个 GIAB 样本在不同的运行中成功测序了 70 次。第五,我们提供了使用 DRAGEN 原始读存档(ORA)压缩原始数据的经验数据。应用中最重要的质量指标是遗传相似性、样本交叉污染、与预期 Het/Hom 比率的偏差、相关性和覆盖率。使用 DRAGEN ORA 对原始文件的压缩率为 5.6:1,压缩时间与基因组覆盖率成线性关系。总之,大型 WGS 研究的预处理、联合调用和质量控制在合理的时间内是可行的,高效的质量控制程序也是现成的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control

Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control

Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Accounts of Chemical Research
Accounts of Chemical Research 化学-化学综合
CiteScore
31.40
自引率
1.10%
发文量
312
审稿时长
2 months
期刊介绍: Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信