High-resolution global diversity copy number variation maps and association with ctyper

Mark Chaisson, Walfred Ma
{"title":"High-resolution global diversity copy number variation maps and association with ctyper","authors":"Mark Chaisson, Walfred Ma","doi":"10.1101/2024.08.11.607269","DOIUrl":null,"url":null,"abstract":"Genetic analysis of copy number variations (CNVs), especially in complex regions, is challenging due to reference bias and ambiguous alignment of Next-Generation Sequencing (NGS) reads to repetitive DNA. Consequently, aggregate copy numbers are typically analyzed, overlooking variation between gene copies. Pangenomes contain diverse sequences of gene copies and enable the study of sequence-resolved CNVs. We developed a method, ctyper, to discover sequence-resolved CNVs in NGS data by leveraging CNV genes from pangenomes. From 118 public assemblies, we constructed a database of 3,351 CNV genes, distinguishing each gene copy as a resolved allele. We used phylogenetic trees to organize alleles into highly similar allele-types that revealed events of linked small variants due to stratification, structural variation, conversion, and duplication. Saturation analysis showed that new samples share an average of 97.8% CNV alleles with the database. The ctyper method traces individual gene copies in NGS data to their nearest alleles in the database and identifies allele-specific copy numbers using multivariate linear regression on k-mer counts and phylogenetic clustering. Applying ctyper to 1000 Genomes Project (1kgp) samples showed Hardy-Weinberg Equilibrium on 99.3% of alleles and a 97.6% F1 score on genotypes based on 641 1kgp trios. Leave-one-out analysis on 39 assemblies matched to 1kgp samples showed that 96.5% of variants in query sequences match the genotyped allele. Genotyping 1kgp data revealed 226 population-specific CNVs, including a conversion on SMN2 to SMN1, potentially impacting Spinal Muscular Atrophy diagnosis in Africans. Our results revealed two models of CNV: recent CNVs due to ongoing duplications and polymorphic CNVs from ancient paralogs missing from the reference. To measure the functional impact of CNVs, after merging allele-types, we conducted genome-wide Quantitative Trait Locus analysis on 451 1kgp samples with Geuvadis rRNA-seqs. Using a linear mixed model, our genotyping enables the inference of relative expression levels of paralogs within a gene family. In a global evolutionary context, 150 out of 1,890 paralogs (7.94%) and 546 out of 16,628 orthologs (3.28%) had significantly different expression levels, suggesting divergent expression from original genes. Specific examples include lower expression on the converted SMN and increased expression on translocated AMY2B (GTEx pancreas data). Our method enables large cohort studies on complex CNVs to uncover hidden health impacts and overcome reference bias.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.11.607269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Genetic analysis of copy number variations (CNVs), especially in complex regions, is challenging due to reference bias and ambiguous alignment of Next-Generation Sequencing (NGS) reads to repetitive DNA. Consequently, aggregate copy numbers are typically analyzed, overlooking variation between gene copies. Pangenomes contain diverse sequences of gene copies and enable the study of sequence-resolved CNVs. We developed a method, ctyper, to discover sequence-resolved CNVs in NGS data by leveraging CNV genes from pangenomes. From 118 public assemblies, we constructed a database of 3,351 CNV genes, distinguishing each gene copy as a resolved allele. We used phylogenetic trees to organize alleles into highly similar allele-types that revealed events of linked small variants due to stratification, structural variation, conversion, and duplication. Saturation analysis showed that new samples share an average of 97.8% CNV alleles with the database. The ctyper method traces individual gene copies in NGS data to their nearest alleles in the database and identifies allele-specific copy numbers using multivariate linear regression on k-mer counts and phylogenetic clustering. Applying ctyper to 1000 Genomes Project (1kgp) samples showed Hardy-Weinberg Equilibrium on 99.3% of alleles and a 97.6% F1 score on genotypes based on 641 1kgp trios. Leave-one-out analysis on 39 assemblies matched to 1kgp samples showed that 96.5% of variants in query sequences match the genotyped allele. Genotyping 1kgp data revealed 226 population-specific CNVs, including a conversion on SMN2 to SMN1, potentially impacting Spinal Muscular Atrophy diagnosis in Africans. Our results revealed two models of CNV: recent CNVs due to ongoing duplications and polymorphic CNVs from ancient paralogs missing from the reference. To measure the functional impact of CNVs, after merging allele-types, we conducted genome-wide Quantitative Trait Locus analysis on 451 1kgp samples with Geuvadis rRNA-seqs. Using a linear mixed model, our genotyping enables the inference of relative expression levels of paralogs within a gene family. In a global evolutionary context, 150 out of 1,890 paralogs (7.94%) and 546 out of 16,628 orthologs (3.28%) had significantly different expression levels, suggesting divergent expression from original genes. Specific examples include lower expression on the converted SMN and increased expression on translocated AMY2B (GTEx pancreas data). Our method enables large cohort studies on complex CNVs to uncover hidden health impacts and overcome reference bias.
高分辨率全球多样性拷贝数变异图及与 ctyper 的关联
由于参考偏差和下一代测序(NGS)读数与重复 DNA 的配准不明确,拷贝数变异(CNV)的遗传分析具有挑战性,尤其是在复杂区域。因此,通常分析的是总拷贝数,而忽略了基因拷贝之间的变异。庞基因组包含多种基因拷贝序列,可以研究序列解析的 CNV。我们开发了一种名为 ctyper 的方法,利用庞基因组中的 CNV 基因发现 NGS 数据中的序列解析 CNV。我们从 118 个公共集合中构建了一个包含 3,351 个 CNV 基因的数据库,将每个基因拷贝区分为一个已解析等位基因。我们利用系统发生树将等位基因组织成高度相似的等位基因类型,揭示了由于分层、结构变异、转换和复制而产生的连锁小变异事件。饱和度分析表明,新样本与数据库平均共享 97.8% 的 CNV 等位基因。ctyper 方法将 NGS 数据中的单个基因拷贝追踪到数据库中与其最近的等位基因,并利用 k-mer 计数的多元线性回归和系统发育聚类确定等位基因特异性拷贝数。将 ctyper 应用于 1000 基因组计划(1kgp)样本,结果显示 99.3% 的等位基因达到哈代-温伯格平衡,基于 641 个 1kgp 三组基因型的 F1 评分达到 97.6%。对与 1kgp 样本相匹配的 39 个集合进行的留空分析表明,查询序列中 96.5%的变异与基因分型等位基因相匹配。对 1kgp 数据进行基因分型发现了 226 个人群特异性 CNV,包括 SMN2 到 SMN1 的转换,这可能会影响非洲人脊髓性肌肉萎缩症的诊断。我们的研究结果揭示了两种 CNV 模式:正在进行的复制导致的新近 CNV 和来自参考文献中缺失的古老旁系亲属的多态 CNV。为了衡量 CNVs 的功能影响,在合并等位基因类型后,我们利用 Geuvadis rRNA-seqs 对 451 个 1kgp 样本进行了全基因组定量性状基因座分析。利用线性混合模型,我们的基因分型能够推断基因家族内旁系亲属的相对表达水平。在全球进化背景下,1,890 个旁系亲属中的 150 个(7.94%)和 16,628 个直系亲属中的 546 个(3.28%)的表达水平存在显著差异,表明与原始基因的表达存在分歧。具体例子包括转换的 SMN 表达较低,而易位的 AMY2B 表达较高(GTEx 胰腺数据)。我们的方法使复杂 CNV 的大型队列研究能够发现隐藏的健康影响并克服参考偏倚。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信