低覆盖率序列数据如何有助于未来的遗传评估的展望。

IF 2.9 2区农林科学 Q1 AGRICULTURE, DAIRY & ANIMAL SCIENCE

Journal of animal science Pub Date : 2025-09-05 DOI:10.1093/jas/skaf294

R Mark Thallman,J E Borgert,Bailey N Engle,John W Keele,Warren M Snelling,Cedric Gondro,Larry A Kuehn

{"title":"低覆盖率序列数据如何有助于未来的遗传评估的展望。","authors":"R Mark Thallman,J E Borgert,Bailey N Engle,John W Keele,Warren M Snelling,Cedric Gondro,Larry A Kuehn","doi":"10.1093/jas/skaf294","DOIUrl":null,"url":null,"abstract":"Low-coverage sequencing refers to sequencing DNA of individuals to a low depth of coverage (e.g., 0.5X) and imputing that sequence to genomic sequence based on reference haplotypes from individuals sequenced to high depth of coverage (e.g., ≥ 10X). It has been proposed as an alternative to genotyping by SNP arrays. At least one commercial product based on it is available for agricultural species. Concerns limiting adoption in its current form are: 1) the cost of storing the huge volume of data it generates and 2) whether that additional data will result in improved accuracy of genetic evaluation. This work envisions future implementation of low-coverage sequencing to reduce storage costs and enhance genetic evaluations by leveraging the additional information in the full sequence of the pangenome to account for more genetic variation. We propose addressing the storage issue by representing genomic sequence of an individual in a pair of haplotype arrays with each element pointing to an enumerated haplotype of the sequence within one of approximately 50,000 defined genome segments. Assuming 60 million genomic variants, the infrastructure required to translate the identifier of any enumerated haplotype into its genomic sequence would require less than 10 gigabytes of binary storage. Each haplotype array element would require 2 bytes, so the marginal binary storage required to represent the genomic sequence of an individual would be about 200 kilobytes (KB), similar to the genotypes from a SNP array with 200,000 markers. This assumes no pedigree and no ambiguity of the imputation, though the latter is unrealistic. Strategies to minimize, and when necessary, to manage and efficiently represent ambiguity are proposed. The genomic sequence of an individual could be stored in about 1 KB (binary) if both parents have unambiguous sequence stored as described above. The proposed system for representing the pangenome includes algorithms for read mapping and imputation intended to leverage all known genetic variation in the target population. It is also designed to use sequencing reads generated for imputing genomic sequence of new individuals to identify unrecognized mutations, crossovers, and structural variants, thus continuously improving the genome representation, especially if widespread use of low-coverage sequencing in livestock industries is realized. This could make improved genetic merit and management of livestock feasible without computational burden.","PeriodicalId":14895,"journal":{"name":"Journal of animal science","volume":"64 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Vision of How Low-Coverage Sequence Data Should Contribute to Genetic Evaluation in the Future.\",\"authors\":\"R Mark Thallman,J E Borgert,Bailey N Engle,John W Keele,Warren M Snelling,Cedric Gondro,Larry A Kuehn\",\"doi\":\"10.1093/jas/skaf294\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Low-coverage sequencing refers to sequencing DNA of individuals to a low depth of coverage (e.g., 0.5X) and imputing that sequence to genomic sequence based on reference haplotypes from individuals sequenced to high depth of coverage (e.g., ≥ 10X). It has been proposed as an alternative to genotyping by SNP arrays. At least one commercial product based on it is available for agricultural species. Concerns limiting adoption in its current form are: 1) the cost of storing the huge volume of data it generates and 2) whether that additional data will result in improved accuracy of genetic evaluation. This work envisions future implementation of low-coverage sequencing to reduce storage costs and enhance genetic evaluations by leveraging the additional information in the full sequence of the pangenome to account for more genetic variation. We propose addressing the storage issue by representing genomic sequence of an individual in a pair of haplotype arrays with each element pointing to an enumerated haplotype of the sequence within one of approximately 50,000 defined genome segments. Assuming 60 million genomic variants, the infrastructure required to translate the identifier of any enumerated haplotype into its genomic sequence would require less than 10 gigabytes of binary storage. Each haplotype array element would require 2 bytes, so the marginal binary storage required to represent the genomic sequence of an individual would be about 200 kilobytes (KB), similar to the genotypes from a SNP array with 200,000 markers. This assumes no pedigree and no ambiguity of the imputation, though the latter is unrealistic. Strategies to minimize, and when necessary, to manage and efficiently represent ambiguity are proposed. The genomic sequence of an individual could be stored in about 1 KB (binary) if both parents have unambiguous sequence stored as described above. The proposed system for representing the pangenome includes algorithms for read mapping and imputation intended to leverage all known genetic variation in the target population. It is also designed to use sequencing reads generated for imputing genomic sequence of new individuals to identify unrecognized mutations, crossovers, and structural variants, thus continuously improving the genome representation, especially if widespread use of low-coverage sequencing in livestock industries is realized. This could make improved genetic merit and management of livestock feasible without computational burden.\",\"PeriodicalId\":14895,\"journal\":{\"name\":\"Journal of animal science\",\"volume\":\"64 1\",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of animal science\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://doi.org/10.1093/jas/skaf294\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AGRICULTURE, DAIRY & ANIMAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of animal science","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.1093/jas/skaf294","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}

引用次数: 0

摘要

低覆盖率测序是指将个体的DNA测序到低覆盖深度（如0.5X），并根据来自高覆盖深度（如≥10X）的个体的参考单倍型将该序列代入基因组序列。它已被提出作为一种替代的基因分型的SNP阵列。至少有一种基于它的商业产品可用于农业物种。限制采用目前形式的问题是：1)存储它产生的大量数据的成本和2)这些额外的数据是否会提高遗传评估的准确性。这项工作设想了未来低覆盖测序的实施，通过利用泛基因组全序列中的附加信息来解释更多的遗传变异，从而降低存储成本并加强遗传评估。我们建议通过在一对单倍型数组中表示个体的基因组序列来解决存储问题，每个元素指向大约50,000个已定义的基因组片段中的一个序列的枚举单倍型。假设有6000万个基因组变体，将任何单倍型的标识符转换为其基因组序列所需的基础设施将需要不到10千兆字节的二进制存储。每个单倍型数组元素将需要2字节，因此表示个体基因组序列所需的边际二进制存储将大约为200千字节（KB），类似于具有200,000个标记的SNP数组的基因型。这假设没有血统和没有模糊的归责，尽管后者是不现实的。提出了减少歧义的策略，并在必要时管理和有效地表示歧义。一个个体的基因组序列可以存储在大约1kb（二进制），如果父母双方都有明确的序列存储如上所述。提出的泛基因组表示系统包括用于读取映射和插入的算法，旨在利用目标群体中所有已知的遗传变异。它还可以利用为新个体的基因组序列输入而产生的测序reads来识别未被识别的突变、交叉和结构变异，从而不断提高基因组代表性，特别是如果实现低覆盖率测序在畜牧业的广泛使用。这可以在没有计算负担的情况下改进牲畜的遗传价值和管理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Vision of How Low-Coverage Sequence Data Should Contribute to Genetic Evaluation in the Future.

Low-coverage sequencing refers to sequencing DNA of individuals to a low depth of coverage (e.g., 0.5X) and imputing that sequence to genomic sequence based on reference haplotypes from individuals sequenced to high depth of coverage (e.g., ≥ 10X). It has been proposed as an alternative to genotyping by SNP arrays. At least one commercial product based on it is available for agricultural species. Concerns limiting adoption in its current form are: 1) the cost of storing the huge volume of data it generates and 2) whether that additional data will result in improved accuracy of genetic evaluation. This work envisions future implementation of low-coverage sequencing to reduce storage costs and enhance genetic evaluations by leveraging the additional information in the full sequence of the pangenome to account for more genetic variation. We propose addressing the storage issue by representing genomic sequence of an individual in a pair of haplotype arrays with each element pointing to an enumerated haplotype of the sequence within one of approximately 50,000 defined genome segments. Assuming 60 million genomic variants, the infrastructure required to translate the identifier of any enumerated haplotype into its genomic sequence would require less than 10 gigabytes of binary storage. Each haplotype array element would require 2 bytes, so the marginal binary storage required to represent the genomic sequence of an individual would be about 200 kilobytes (KB), similar to the genotypes from a SNP array with 200,000 markers. This assumes no pedigree and no ambiguity of the imputation, though the latter is unrealistic. Strategies to minimize, and when necessary, to manage and efficiently represent ambiguity are proposed. The genomic sequence of an individual could be stored in about 1 KB (binary) if both parents have unambiguous sequence stored as described above. The proposed system for representing the pangenome includes algorithms for read mapping and imputation intended to leverage all known genetic variation in the target population. It is also designed to use sequencing reads generated for imputing genomic sequence of new individuals to identify unrecognized mutations, crossovers, and structural variants, thus continuously improving the genome representation, especially if widespread use of low-coverage sequencing in livestock industries is realized. This could make improved genetic merit and management of livestock feasible without computational burden.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of animal science 农林科学-奶制品与动物科学

CiteScore

4.80

自引率

12.10%

发文量

1589

审稿时长

3 months

期刊介绍： The Journal of Animal Science (JAS) is the premier journal for animal science and serves as the leading source of new knowledge and perspective in this area. JAS publishes more than 500 fully reviewed research articles, invited reviews, technical notes, and letters to the editor each year. Articles published in JAS encompass a broad range of research topics in animal production and fundamental aspects of genetics, nutrition, physiology, and preparation and utilization of animal products. Articles typically report research with beef cattle, companion animals, goats, horses, pigs, and sheep; however, studies involving other farm animals, aquatic and wildlife species, and laboratory animal species that address fundamental questions related to livestock and companion animal biology will be considered for publication.