Anna A. Igolkina, Sebastian Vorbrugg, Fernando A. Rabanal, Hai-Jun Liu, Haim Ashkenazy, Aleksandra E. Kornienko, Joffrey Fitz, Max Collenberg, Christian Kubica, Almudena Mollá Morales, Benjamin Jaegle, Travis Wrightsman, Vitaly Voloshin, Alexander D. Bezlepsky, Victor Llaca, Viktoria Nizhynska, Ilka Reichardt, Ilja Bezrukov, Christa Lanz, Felix Bemm, Pádraic J. Flood, Sileshi Nemomissa, Angela Hancock, Ya-Long Guo, Paul Kersey, Detlef Weigel, Magnus Nordborg
{"title":"A comparison of 27 Arabidopsis thaliana genomes and the path toward an unbiased characterization of genetic polymorphism","authors":"Anna A. Igolkina, Sebastian Vorbrugg, Fernando A. Rabanal, Hai-Jun Liu, Haim Ashkenazy, Aleksandra E. Kornienko, Joffrey Fitz, Max Collenberg, Christian Kubica, Almudena Mollá Morales, Benjamin Jaegle, Travis Wrightsman, Vitaly Voloshin, Alexander D. Bezlepsky, Victor Llaca, Viktoria Nizhynska, Ilka Reichardt, Ilja Bezrukov, Christa Lanz, Felix Bemm, Pádraic J. Flood, Sileshi Nemomissa, Angela Hancock, Ya-Long Guo, Paul Kersey, Detlef Weigel, Magnus Nordborg","doi":"10.1038/s41588-025-02293-0","DOIUrl":null,"url":null,"abstract":"Making sense of whole-genome polymorphism data is challenging, but it is essential for overcoming the biases in SNP data. Here we analyze 27 genomes of Arabidopsis thaliana to illustrate these issues. Genome size variation is mostly due to tandem repeat regions that are difficult to assemble. However, while the rest of the genome varies little in length, it is full of structural variants, mostly due to transposon insertions. Because of this, the pangenome coordinate system grows rapidly with sample size and ultimately becomes 70% larger than the size of any single genome, even for n = 27. Finally, we show how short-read data are biased by read mapping. SNP calling is biased by the choice of reference genome, and both transcriptome and methylome profiling results are affected by mapping reads to a reference genome rather than to the genome of the assayed individual. New concepts for comparing the genomes of 27 naturally inbred Arabidopsis thaliana accessions provide essential insights into obtaining a less biased view of whole-genome polymorphism.","PeriodicalId":18985,"journal":{"name":"Nature genetics","volume":"57 9","pages":"2289-2301"},"PeriodicalIF":29.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.comhttps://www.nature.com/articles/s41588-025-02293-0.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature genetics","FirstCategoryId":"99","ListUrlMain":"https://www.nature.com/articles/s41588-025-02293-0","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Making sense of whole-genome polymorphism data is challenging, but it is essential for overcoming the biases in SNP data. Here we analyze 27 genomes of Arabidopsis thaliana to illustrate these issues. Genome size variation is mostly due to tandem repeat regions that are difficult to assemble. However, while the rest of the genome varies little in length, it is full of structural variants, mostly due to transposon insertions. Because of this, the pangenome coordinate system grows rapidly with sample size and ultimately becomes 70% larger than the size of any single genome, even for n = 27. Finally, we show how short-read data are biased by read mapping. SNP calling is biased by the choice of reference genome, and both transcriptome and methylome profiling results are affected by mapping reads to a reference genome rather than to the genome of the assayed individual. New concepts for comparing the genomes of 27 naturally inbred Arabidopsis thaliana accessions provide essential insights into obtaining a less biased view of whole-genome polymorphism.
期刊介绍:
Nature Genetics publishes the very highest quality research in genetics. It encompasses genetic and functional genomic studies on human and plant traits and on other model organisms. Current emphasis is on the genetic basis for common and complex diseases and on the functional mechanism, architecture and evolution of gene networks, studied by experimental perturbation.
Integrative genetic topics comprise, but are not limited to:
-Genes in the pathology of human disease
-Molecular analysis of simple and complex genetic traits
-Cancer genetics
-Agricultural genomics
-Developmental genetics
-Regulatory variation in gene expression
-Strategies and technologies for extracting function from genomic data
-Pharmacological genomics
-Genome evolution