GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf001
Yan Wang, Xiaopeng Hao, Chunhai Chen, Haigang Wang, Peng Gao, Xukui Yang, Xue Dong, Huibin Qin, Meng Li, Sen Hou, Jianbo Jian, Jianwu Chang, Jing Wu, Zhixin Mu
{"title":"Telomere-to-telomere genome of common bean (Phaseolus vulgaris L., YP4).","authors":"Yan Wang, Xiaopeng Hao, Chunhai Chen, Haigang Wang, Peng Gao, Xukui Yang, Xue Dong, Huibin Qin, Meng Li, Sen Hou, Jianbo Jian, Jianwu Chang, Jing Wu, Zhixin Mu","doi":"10.1093/gigascience/giaf001","DOIUrl":"10.1093/gigascience/giaf001","url":null,"abstract":"<p><strong>Background: </strong>Common bean is a significant grain legume in human diets. However, the lack of a complete reference genome for common beans has hindered efforts to improve agronomic cultivars.</p><p><strong>Findings: </strong>Herein, we present the first telomere-to-telomere (T2T) genome assembly of common bean (Phaseolus vulgaris L., YP4) using PacBio High-Fidelity reads, ONT ultra-long sequencing, and Hi-C technologies. The assembly resulted in a genome size of 560.30 Mb with an N50 of 55.11 Mb, exhibiting high completeness and accuracy (BUSCO score: 99.5%, quality value (QV): 54.86). The sequences were anchored into 11 chromosomes, with 20 of 22 telomeres identified, leading to the formation of 9 T2T pseudomolecules. Furthermore, we identified repetitive elements accounting for 61.20% of the genome and predicted 29,925 protein-coding genes. Phylogenetic analysis suggested an estimated divergence time of approximately 11.6 million years ago between P. vulgaris and Vigna angularis. Comparative genome analysis revealed the expanded gene families and variations between YP4 and G19833 associated with defense response.</p><p><strong>Conclusions: </strong>The T2T reference genome and genomic insights presented here are crucial for future genetic studies not only in common bean but also in other legumes.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077395/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144077126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf037
Abdullatif Al-Najim, Sven Hauns, Van Dinh Tran, Rolf Backofen, Omer S Alkhnbashi
{"title":"HVSeeker: a deep-learning-based method for identification of host and viral DNA sequences.","authors":"Abdullatif Al-Najim, Sven Hauns, Van Dinh Tran, Rolf Backofen, Omer S Alkhnbashi","doi":"10.1093/gigascience/giaf037","DOIUrl":"10.1093/gigascience/giaf037","url":null,"abstract":"<p><strong>Background: </strong>Bacteriophages are among the most abundant organisms on Earth, significantly impacting ecosystems and human society. The identification of viral sequences, especially novel ones, from mixed metagenomes is a critical first step in analyzing the viral components of host samples. This plays a key role in many downstream tasks. However, this is a challenging task due to their rapid evolution rate. The identification process typically involves two steps: distinguishing viral sequences from the host and identifying if they come from novel viral genomes. Traditional metagenomic techniques that rely on sequence similarity with known entities often fall short, especially when dealing with short or novel genomes. Meanwhile, deep learning has demonstrated its efficacy across various domains, including the bioinformatics field.</p><p><strong>Results: </strong>We have developed HVSeeker-a host/virus seeker method-based on deep learning to distinguish between bacterial and phage sequences. HVSeeker consists of two separate models: one analyzing DNA sequences and the other focusing on proteins. In addition to the robust architecture of HVSeeker, three distinct preprocessing methods were introduced to enhance the learning process: padding, contigs assembly, and sliding window. This method has shown promising results on sequences with various lengths, ranging from 200 to 1,500 base pairs. Tested on both NCBI and IMGVR databases, HVSeeker outperformed several methods from the literature such as Seeker, Rnn-VirSeeker, DeepVirFinder, and PPR-Meta. Moreover, when compared with other methods on benchmark datasets, HVSeeker has shown better performance, establishing its effectiveness in identifying unknown phage genomes.</p><p><strong>Conclusions: </strong>These results demonstrate the exceptional structure of HVSeeker, which encompasses both the preprocessing methods and the model design. The advancements provided by HVSeeker are significant for identifying viral genomes and developing new therapeutic approaches, such as phage therapy. Therefore, HVSeeker serves as an essential tool in prokaryotic and phage taxonomy, offering a crucial first step toward analyzing the host-viral component of samples by identifying the host and viral sequences in mixed metagenomes.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12080225/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144077444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf010
Tobias Bachmann, Karsten Mueller, Simon N A Kusnezow, Matthias L Schroeter, Paolo Piaggi, Christopher M Weise
{"title":"Cerebellocerebral connectivity predicts body mass index: a new open-source Python-based framework for connectome-based predictive modeling.","authors":"Tobias Bachmann, Karsten Mueller, Simon N A Kusnezow, Matthias L Schroeter, Paolo Piaggi, Christopher M Weise","doi":"10.1093/gigascience/giaf010","DOIUrl":"10.1093/gigascience/giaf010","url":null,"abstract":"<p><strong>Background: </strong>The cerebellum is one of the major central nervous structures consistently altered in obesity. Its role in higher cognitive function, parts of which are affected by obesity, is mediated through projections to and from the cerebral cortex. We therefore investigated the relationship between body mass index (BMI) and cerebellocerebral connectivity.</p><p><strong>Methods: </strong>We utilized the Human Connectome Project's Young Adults dataset, including functional magnetic resonance imaging (fMRI) and behavioral data, to perform connectome-based predictive modeling (CPM) restricted to cerebellocerebral connectivity of resting-state fMRI and task-based fMRI. We developed a Python-based open-source framework to perform CPM, a data-driven technique with built-in cross-validation to establish brain-behavior relationships. Significance was assessed with permutation analysis.</p><p><strong>Results: </strong>We found that (i) cerebellocerebral connectivity predicted BMI, (ii) task-general cerebellocerebral connectivity predicted BMI more reliably than resting-state fMRI and individual task-based fMRI separately, (iii) predictive networks derived this way overlapped with established functional brain networks (namely, frontoparietal networks, the somatomotor network, the salience network, and the default mode network), and (iv) we found there was an inverse overlap between networks predictive of BMI and networks predictive of cognitive measures adversely affected by overweight/obesity.</p><p><strong>Conclusions: </strong>Our results suggest obesity-specific alterations in cerebellocerebral connectivity, specifically with regard to task execution. With brain areas and brain networks relevant to task performance implicated, these alterations seem to reflect a neurobiological substrate for task performance adversely affected by obesity.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143614577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giae031
{"title":"Correction to: Habitat suitability maps for Australian flora and fauna under CMIP6 climate scenarios.","authors":"","doi":"10.1093/gigascience/giae031","DOIUrl":"10.1093/gigascience/giae031","url":null,"abstract":"","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11880536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143556459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf008
Tianchi Lu, Xueying Wang, Wan Nie, Miaozhe Huo, Shuaicheng Li
{"title":"TransHLA: a Hybrid Transformer model for HLA-presented epitope detection.","authors":"Tianchi Lu, Xueying Wang, Wan Nie, Miaozhe Huo, Shuaicheng Li","doi":"10.1093/gigascience/giaf008","DOIUrl":"10.1093/gigascience/giaf008","url":null,"abstract":"<p><strong>Background: </strong>Precise prediction of epitope presentation on human leukocyte antigen (HLA) molecules is crucial for advancing vaccine development and immunotherapy. Conventional HLA-peptide binding affinity prediction tools often focus on specific alleles and lack a universal approach for comprehensive HLA site analysis. This limitation hinders efficient filtering of invalid peptide segments.</p><p><strong>Results: </strong>We introduce TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures. TransHLA utilizes the ESM2 large language model for sequence and structure embeddings, achieving high predictive accuracy. For HLA class I, it reaches an accuracy of 84.72% and an area under the curve (AUC) of 91.95% on IEDB test data. For HLA class II, it achieves 79.94% accuracy and an AUC of 88.14%. Our case studies using datasets like CEDAR and VDJdb demonstrate that TransHLA surpasses existing models in specificity and sensitivity for identifying immunogenic epitopes and neoepitopes.</p><p><strong>Conclusions: </strong>TransHLA significantly enhances vaccine design and immunotherapy by efficiently identifying broadly reactive peptides. Our resources, including data and code, are publicly accessible at https://github.com/SkywalkerLuke/TransHLA.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11878767/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143556462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf033
Chloe Engler Hart, Yojana Gadiya, Tobias Kind, Christoph A Krettler, Matthew Gaetz, Biswapriya B Misra, David Healey, August Allen, Viswa Colluru, Daniel Domingo-Fernández
{"title":"Defining the limits of plant chemical space: challenges and estimations.","authors":"Chloe Engler Hart, Yojana Gadiya, Tobias Kind, Christoph A Krettler, Matthew Gaetz, Biswapriya B Misra, David Healey, August Allen, Viswa Colluru, Daniel Domingo-Fernández","doi":"10.1093/gigascience/giaf033","DOIUrl":"10.1093/gigascience/giaf033","url":null,"abstract":"<p><p>The plant kingdom, encompassing nearly 400,000 known species, produces an immense diversity of metabolites, including primary compounds essential for survival and secondary metabolites specialized for ecological interactions. These metabolites constitute a vast and complex phytochemical space with significant potential applications in medicine, agriculture, and biotechnology. However, much of this chemical diversity remains unexplored, as only a fraction of plant species has been studied comprehensively. In this work, we estimate the size of the plant chemical space by leveraging large-scale metabolomics and literature datasets. We begin by examining the known chemical space, which, while containing at most several hundred thousand unique compounds, remains sparsely covered. Using data from over 1,000 plant species, we apply various mass spectrometry-based approaches-a formula prediction model, a de novo prediction model, a combination of library search and de novo prediction, and MS2 clustering-to estimate the number of unique structures. Our methods suggest that the number of unique compounds in the metabolomics dataset alone may already surpass existing estimates of plant chemical diversity. Finally, we project these findings across the entire plant kingdom, estimating that the total plant chemical space likely spans millions, if not more, with most still unexplored.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11970369/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143784433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf112
Moustafa Shokrof, Mohamed Abuelanin, C Titus Brown, Tamer A Mansour
{"title":"The Great Genotyper: a graph-based method for population genotyping of small and structural variants.","authors":"Moustafa Shokrof, Mohamed Abuelanin, C Titus Brown, Tamer A Mansour","doi":"10.1093/gigascience/giaf112","DOIUrl":"10.1093/gigascience/giaf112","url":null,"abstract":"<p><strong>Background: </strong>Long-read sequencing (LRS) enables high-quality structural variant (SV) discovery. SV genotypers utilize these precise call sets to improve the recall and precision of genotyping in short-read sequencing (SRS) samples. With the extensive growth in publicly available SRS datasets, it is now possible to calculate accurate population allele frequencies of SVs. However, reprocessing hundreds of terabytes of raw SRS data to genotype new variants is impractical for population-scale studies, a computational challenge known as the N+1 problem (i.e., the challenge of re-genotyping an entire cohort for one additional variant). Overcoming this computational bottleneck is essential for analyzing new SVs from the growing number of pangenomes, public genomic databases, and pathogenic variant discovery studies.</p><p><strong>Results: </strong>We propose the Great Genotyper, a population-scale genotyping workflow to address the N+1 problem. Applied to a human dataset, the workflow begins by preprocessing 4.2k short-read samples of a total of 183 TB raw data to create an 867-GB Counting Colored de Bruijn Graph (CCDG). The Great Genotyper uses this CCDG to genotype a list of phased or unphased variants, leveraging the CCDG population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art genotypers while achieving unprecedented performance. It took about 100 hours to genotype 4.5M variants across the 4.2k samples and calculate their population allele frequencies using 1 server with 32 cores and 145 GB of memory. The Great Genotyper opens the door to new ways to study SVs. For example, using the premade index, we demonstrate the Great Genotyper's application in finding pathogenic variants by calculating accurate allele frequency for novel SVs. Also, we used it to create a 4k reference panel by genotyping variants from the Human Pangenome Reference Consortium (HPRC). The new reference panel allows for SV imputation from genotyping microarrays. Moreover, we genotype the human GWAS Catalog and merge its variants with the 4k reference panel. We show 6,253 events of high linkage between the HPRC's SVs and nearby GWAS single-nucleotide polymorphisms, which can help in interpreting the effect of these SVs on gene functions. This analysis uncovers the detailed haplotype structure of the human fibrinogen locus and revives the pathogenic association of a 28-bp insertion in the FGA gene with thromboembolic disorders.</p><p><strong>Conclusion: </strong>The Great Genotyper solves the N+1 problem for population-scale genotyping of small and structural variants, offering both high accuracy and efficiency. Its ability to rapidly re-genotype large cohorts paves the road for several new studies of SVs.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491952/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145212315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giaf118
Hasindu Gamaarachchi, Sasha Jenner, Hiruna Samarakoon, James M Ferguson, Ira W Deveson
{"title":"The enduring advantages of the SLOW5 file format for raw nanopore sequencing data.","authors":"Hasindu Gamaarachchi, Sasha Jenner, Hiruna Samarakoon, James M Ferguson, Ira W Deveson","doi":"10.1093/gigascience/giaf118","DOIUrl":"10.1093/gigascience/giaf118","url":null,"abstract":"<p><p>Nanopore sequencing is a widespread and important method in genomics science. The raw electrical current signal data from a typical nanopore sequencing experiment are large and complex. This can be stored in 2 alternative file formats that are presently supported: POD5 is a signal data file format used by default on instruments from Oxford Nanopore Technologies (ONT); SLOW5 is an open-source file format originally developed as an alternative to ONT's previous file format, which was known as FAST5. The choice of format may have important implications for the cost, speed, and simplicity of nanopore signal data analysis, management, and storage. To inform this choice, we present a comparative evaluation of POD5 versus SLOW5. We conducted benchmarking experiments assessing file size, analysis performance, and usability on a variety of different computer architectures. Binary SLOW5 (BLOW5) showed superior performance during sequential and nonsequential (random access) file reading on most systems, manifesting in faster, cheaper basecalling and other analysis, and we could find no instance in which POD5 file reading was significantly faster than BLOW5. We demonstrate that BLOW5 file writing is highly parallelizable, thereby meeting the demands of data acquisition on ONT instruments. Our analysis also identified differences in the complexity and stability of the software libraries for SLOW5 (slow5lib) and POD5 (pod5), including a large discrepancy in the number of underlying software dependencies, which may complicate the pod5 compilation process. In summary, many of the advantages originally conceived for SLOW5 remain relevant today, despite the replacement of FAST5 with POD5 as ONT's core file format.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530089/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145307711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giae113
Yuqi Liu, Abdulkadir Elmas, Kuan-Lin Huang
{"title":"Mutation impact on mRNA versus protein expression across human cancers.","authors":"Yuqi Liu, Abdulkadir Elmas, Kuan-Lin Huang","doi":"10.1093/gigascience/giae113","DOIUrl":"10.1093/gigascience/giae113","url":null,"abstract":"<p><strong>Background: </strong>Cancer mutations are often assumed to alter proteins, thus promoting tumorigenesis. However, how mutations affect protein expression-in addition to gene expression-has rarely been systematically investigated. This is significant as mRNA and protein levels frequently show only moderate correlation, driven by factors such as translation efficiency and protein degradation. Proteogenomic datasets from large tumor cohorts provide an opportunity to systematically analyze the effects of somatic mutations on mRNA and protein abundance and identify mutations with distinct impacts on these molecular levels.</p><p><strong>Results: </strong>We conduct a comprehensive analysis of mutation impacts on mRNA- and protein-level expressions of 953 cancer cases with paired genomics and global proteomic profiling across 6 cancer types. Protein-level impacts are validated for 47.2% of the somatic expression quantitative trait loci (seQTLs), including CDH1 and MSH3 truncations, as well as other mutations from likely \"long-tail\" driver genes. Devising a statistical pipeline for identifying somatic protein-specific QTLs (spsQTLs), we reveal several gene mutations, including NF1 and MAP2K4 truncations and TP53 missenses showing disproportional influence on protein abundance not readily explained by transcriptomics. Cross-validating with data from massively parallel assays of variant effects (MAVE), TP53 missenses associated with high tumor TP53 proteins are more likely to be experimentally confirmed as functional.</p><p><strong>Conclusion: </strong>This study reveals that somatic mutations can exhibit distinct impacts on mRNA and protein levels, underscoring the necessity of integrating proteogenomic data to comprehensively identify functionally significant cancer mutations. These insights provide a framework for prioritizing mutations for further functional validation and therapeutic targeting.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11702362/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142947474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2025-01-06DOI: 10.1093/gigascience/giae121
Zhihui Yuan, Maximilian Rembe, Martin Mascher, Nils Stein, Axel Himmelbach, Murukarthick Jayakodi, Andreas Börner, Klaus Oldach, Ahmed Jahoor, Jens Due Jensen, Julia Rudloff, Viktoria-Elisabeth Dohrendorf, Luisa Pauline Kuhfus, Emmanuelle Dyrszka, Matthieu Conte, Frederik Hinz, Salim Trouchaud, Jochen C Reif, Samira El Hanafi
{"title":"High-quality phenotypic and genotypic dataset of barley genebank core collection to unlock untapped genetic diversity.","authors":"Zhihui Yuan, Maximilian Rembe, Martin Mascher, Nils Stein, Axel Himmelbach, Murukarthick Jayakodi, Andreas Börner, Klaus Oldach, Ahmed Jahoor, Jens Due Jensen, Julia Rudloff, Viktoria-Elisabeth Dohrendorf, Luisa Pauline Kuhfus, Emmanuelle Dyrszka, Matthieu Conte, Frederik Hinz, Salim Trouchaud, Jochen C Reif, Samira El Hanafi","doi":"10.1093/gigascience/giae121","DOIUrl":"10.1093/gigascience/giae121","url":null,"abstract":"<p><strong>Background: </strong>Genebanks around the globe serve as valuable repositories of genetic diversity, offering not only access to a broad spectrum of plant material but also critical resources for enhancing crop resilience, advancing scientific research, and supporting global food security. To this end, traditional genebanks are evolving into biodigital resource centers where the integration of phenotypic and genotypic data for accessions can drive more informed decision-making, optimize resource allocation, and unlock new opportunities for plant breeding and research. However, the curation and availability of interoperable phenotypic and genotypic data for genebank accessions is still in its infancy and represents an obstacle to rapid scientific discoveries in this field. Therefore, effectively promoting FAIR (i.e., findable, accessible, interoperable, and reusable) access to these data is vital for maximizing the potential of genebanks and driving progress in agricultural innovation.</p><p><strong>Findings: </strong>Here we provide whole genome sequencing data of 812 barley (Hordeum vulgare L.) plant genetic resources and 298 European elite materials released between 1949 and 2021, as well as the phenotypic data for 4 disease resistance traits and 3 agronomic traits. The robustness of the investigated traits and the interoperability of genomic and phenotypic data were assessed in the current publication, aiming to make this panel publicly available as a resource for future genetic research in barley.</p><p><strong>Conclusions: </strong>The data showed broad phenotypic variability and high association mapping potential, offering a key resource for identifying genebank donors with untapped genes to advance barley breeding while safeguarding genetic diversity.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11811526/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143390809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}