GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae006
Yujie Huang, Longbiao Guo, Lingjuan Xie, Nianmin Shang, Dongya Wu, Chuyu Ye, Eduardo Carlos Rudell, Kazunori Okada, Qian-Hao Zhu, Beng-Kah Song, Daguang Cai, Aldo Merotto Junior, Lianyang Bai, Longjiang Fan
{"title":"A reference genome of Commelinales provides insights into the commelinids evolution and global spread of water hyacinth (Pontederia crassipes).","authors":"Yujie Huang, Longbiao Guo, Lingjuan Xie, Nianmin Shang, Dongya Wu, Chuyu Ye, Eduardo Carlos Rudell, Kazunori Okada, Qian-Hao Zhu, Beng-Kah Song, Daguang Cai, Aldo Merotto Junior, Lianyang Bai, Longjiang Fan","doi":"10.1093/gigascience/giae006","DOIUrl":"10.1093/gigascience/giae006","url":null,"abstract":"<p><p>Commelinales belongs to the commelinids clade, which also comprises Poales that includes the most important monocot species, such as rice, wheat, and maize. No reference genome of Commelinales is currently available. Water hyacinth (Pontederia crassipes or Eichhornia crassipes), a member of Commelinales, is one of the devastating aquatic weeds, although it is also grown as an ornamental and medical plant. Here, we present a chromosome-scale reference genome of the tetraploid water hyacinth with a total length of 1.22 Gb (over 95% of the estimated size) across 8 pseudochromosome pairs. With the representative genomes, we reconstructed a phylogeny of the commelinids, which supported Zingiberales and Commelinales being sister lineages of Arecales and shed lights on the controversial relationship of the orders. We also reconstructed ancestral karyotypes of the commelinids clade and confirmed the ancient commelinids genome having 8 chromosomes but not 5 as previously reported. Gene family analysis revealed contraction of disease-resistance genes during polyploidization of water hyacinth, likely a result of fitness requirement for its role as a weed. Genetic diversity analysis using 9 water hyacinth lines from 3 continents (South America, Asia, and Europe) revealed very closely related nuclear genomes and almost identical chloroplast genomes of the materials, as well as provided clues about the global dispersal of water hyacinth. The genomic resources of P. crassipes reported here contribute a crucial missing link of the commelinids species and offer novel insights into their phylogeny.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10938897/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140133765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giad111
Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani
{"title":"Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis.","authors":"Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani","doi":"10.1093/gigascience/giad111","DOIUrl":"10.1093/gigascience/giad111","url":null,"abstract":"<p><strong>Background: </strong>Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.</p><p><strong>Results: </strong>To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating 4 essential functionalities-namely, Data Exploration, AutoML, CustomML, and Visualization-MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on 6 distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme's feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.</p><p><strong>Conclusion: </strong>MLme serves as a valuable resource for leveraging ML to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10783149/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139416804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giad113
Sheeba Samuel, Daniel Mietchen
{"title":"Computational reproducibility of Jupyter notebooks from biomedical publications.","authors":"Sheeba Samuel, Daniel Mietchen","doi":"10.1093/gigascience/giad113","DOIUrl":"10.1093/gigascience/giad113","url":null,"abstract":"<p><strong>Background: </strong>Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications.</p><p><strong>Approach: </strong>We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion.</p><p><strong>Results: </strong>Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions.</p><p><strong>Conclusions: </strong>We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10783158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139416803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae068
Danilo Bzdok, Guy Wolf, Jakub Kopal
{"title":"Harnessing population diversity: in search of tools of the trade.","authors":"Danilo Bzdok, Guy Wolf, Jakub Kopal","doi":"10.1093/gigascience/giae068","DOIUrl":"https://doi.org/10.1093/gigascience/giae068","url":null,"abstract":"<p><p>Big neuroscience datasets are not big small datasets when it comes to quantitative data analysis. Neuroscience has now witnessed the advent of many population cohort studies that deep-profile participants, yielding hundreds of measures, capturing dimensions of each individual's position in the broader society. Indeed, there is a rebalancing from small, strictly selected, and thus homogenized cohorts toward always larger, more representative, and thus diverse cohorts. This shift in cohort composition is prompting the revision of incumbent modeling practices. Major sources of population stratification increasingly overshadow the subtle effects that neuroscientists are typically studying. In our opinion, as we sample individuals from always wider diversity backgrounds, we will require a new stack of quantitative tools to realize diversity-aware modeling. We here take inventory of candidate analytical frameworks. Better incorporating driving factors behind population structure will allow refining our understanding of how brain-behavior relationships depend on human subgroups.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142344886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae070
Yuanting Shen, Lidan Tao, Rengang Zhang, Gang Yao, Minjie Zhou, Weibang Sun, Yongpeng Ma
{"title":"Genomic insights into endangerment and conservation of the garlic-fruit tree (Malania oleifera), a plant species with extremely small populations.","authors":"Yuanting Shen, Lidan Tao, Rengang Zhang, Gang Yao, Minjie Zhou, Weibang Sun, Yongpeng Ma","doi":"10.1093/gigascience/giae070","DOIUrl":"10.1093/gigascience/giae070","url":null,"abstract":"<p><strong>Background: </strong>Advanced whole-genome sequencing techniques enable covering nearly all genome nucleotide variations and thus can provide deep insights into protecting endangered species. However, the use of genomic data to make conservation strategies is still rare, particularly for endangered plants. Here we performed comprehensive conservation genomic analysis for Malania oleifera, an endangered tree species with a high amount of nervonic acid. We used whole-genome resequencing data of 165 samples, covering 16 populations across the entire distribution range, to investigate the formation reasons of its extremely small population sizes and to evaluate the possible genomic offsets and changes of ecology niche suitability under future climate change.</p><p><strong>Results: </strong>Although M. oleifera maintains relatively high genetic diversity among endangered woody plants (θπ = 3.87 × 10-3), high levels of inbreeding have been observed, which have reduced genetic diversity in 3 populations (JM, NP, and BM2) and caused the accumulation of deleterious mutations. Repeated bottleneck events, recent inbreeding (∼490 years ago), and anthropogenic disturbance to wild habitats have aggravated the fragmentation of M. oleifera and made it endangered. Due to the significant effect of higher average annual temperature, populations distributed in low altitude exhibit a greater genomic offset. Furthermore, ecological niche modeling shows the suitable habitats for M. oleifera will decrease by 71.15% and 98.79% in 2100 under scenarios SSP126 and SSP585, respectively.</p><p><strong>Conclusions: </strong>The basic realizations concerning the threats to M. oleifera provide scientific foundation for defining management and adaptive units, as well as prioritizing populations for genetic rescue. Meanwhile, we highlight the importance of integrating genomic offset and ecological niche modeling to make targeted conservation actions under future climate change. Overall, our study provides a paradigm for genomics-directed conservation.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142283910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae064
Yan Lu, Fang Luo, An Zhou, Cun Yi, Hao Chen, Jian Li, Yunhai Guo, Yuxiang Xie, Wei Zhang, Datao Lin, Yaming Yang, Zhongdao Wu, Yi Zhang, Shuhua Xu, Wei Hu
{"title":"Whole-genome sequencing of the invasive golden apple snail Pomacea canaliculata from Asia reveals rapid expansion and adaptive evolution.","authors":"Yan Lu, Fang Luo, An Zhou, Cun Yi, Hao Chen, Jian Li, Yunhai Guo, Yuxiang Xie, Wei Zhang, Datao Lin, Yaming Yang, Zhongdao Wu, Yi Zhang, Shuhua Xu, Wei Hu","doi":"10.1093/gigascience/giae064","DOIUrl":"10.1093/gigascience/giae064","url":null,"abstract":"<p><p>Pomacea canaliculata, an invasive species native to South America, is recognized for its broad geographic distribution and adaptability to a variety of ecological conditions. The details concerning the evolution and adaptation of P. canaliculate remain unclear due to a lack of whole-genome resequencing data. We examined 173 P. canaliculata genomes representing 17 geographic populations in East and Southeast Asia. Interestingly, P. canaliculata showed a higher level of genetic diversity than other mollusks, and our analysis suggested that the dispersal of P. canaliculata could have been driven by climate changes and human activities. Notably, we identified a set of genes associated with low temperature adaptation, including Csde1, a cold shock protein coding gene. Further RNA sequencing analysis and reverse transcription quantitative polymerase chain reaction experiments demonstrated the gene's dynamic pattern and biological functions during cold exposure. Moreover, both positive selection and balancing selection are likely to have contributed to the rapid environmental adaptation of P. canaliculata populations. In particular, genes associated with energy metabolism and stress response were undergoing positive selection, while a large number of immune-related genes showed strong signatures of balancing selection. Our study has advanced our understanding of the evolution of P. canaliculata and has provided a valuable resource concerning an invasive species.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142283912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae080
Shizhuo Zhang, Jiyun Han, Juntao Liu
{"title":"Protein-protein and protein-nucleic acid binding site prediction via interpretable hierarchical geometric deep learning.","authors":"Shizhuo Zhang, Jiyun Han, Juntao Liu","doi":"10.1093/gigascience/giae080","DOIUrl":"10.1093/gigascience/giae080","url":null,"abstract":"<p><p>Identification of protein-protein and protein-nucleic acid binding sites provides insights into biological processes related to protein functions and technical guidance for disease diagnosis and drug design. However, accurate predictions by computational approaches remain highly challenging due to the limited knowledge of residue binding patterns. The binding pattern of a residue should be characterized by the spatial distribution of its neighboring residues combined with their physicochemical information interaction, which yet cannot be achieved by previous methods. Here, we design GraphRBF, a hierarchical geometric deep learning model to learn residue binding patterns from big data. To achieve it, GraphRBF describes physicochemical information interactions by designing an enhanced graph neural network and characterizes residue spatial distributions by introducing a prioritized radial basis function neural network. After training and testing, GraphRBF shows great improvements over existing state-of-the-art methods and strong interpretability of its learned representations. Applying GraphRBF to the SARS-CoV-2 omicron spike protein, it successfully identifies known epitopes of the protein. Moreover, it predicts multiple potential binding regions for new nanobodies or even new drugs with strong evidence. A user-friendly online server for GraphRBF is freely available at http://liulab.top/GraphRBF/server.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11528319/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142557605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae032
Hengchao Wang, Dong Xu, Fan Jiang, Sen Wang, Anqi Wang, Hangwei Liu, Lihong Lei, Wanqiang Qian, Wei Fan
{"title":"The genomes of Dahlia pinnata, Cosmos bipinnatus, and Bidens alba in tribe Coreopsideae provide insights into polyploid evolution and inulin biosynthesis.","authors":"Hengchao Wang, Dong Xu, Fan Jiang, Sen Wang, Anqi Wang, Hangwei Liu, Lihong Lei, Wanqiang Qian, Wei Fan","doi":"10.1093/gigascience/giae032","DOIUrl":"10.1093/gigascience/giae032","url":null,"abstract":"<p><strong>Background: </strong>The Coreopsideae tribe, a subset of the Asteraceae family, encompasses economically vital genera like Dahlia, Cosmos, and Bidens, which are widely employed in medicine, horticulture, ecology, and food applications. Nevertheless, the lack of reference genomes hinders evolutionary and biological investigations in this tribe.</p><p><strong>Results: </strong>Here, we present 3 haplotype-resolved chromosome-level reference genomes of the tribe Coreopsideae, including 2 popular flowering plants (Dahlia pinnata and Cosmos bipinnatus) and 1 invasive weed plant (Bidens alba), with assembled genome sizes 3.93 G, 1.02 G, and 1.87 G, respectively. We found that Gypsy transposable elements contribute mostly to the larger genome size of D. pinnata, and multiple chromosome rearrangements have occurred in tribe Coreopsideae. Besides the shared whole-genome duplication (WGD-2) in the Heliantheae alliance, our analyses showed that D. pinnata and B. alba each underwent an independent recent WGD-3 event: in D. pinnata, it is more likely to be a self-WGD, while in B. alba, it is from the hybridization of 2 ancestor species. Further, we identified key genes in the inulin metabolic pathway and found that the pseudogenization of 1-FEH1 and 1-FEH2 genes in D. pinnata and the deletion of 3 key residues of 1-FFT proteins in C. bipinnatus and B. alba may probably explain why D. pinnata produces much more inulin than the other 2 plants.</p><p><strong>Conclusions: </strong>Collectively, the genomic resources for the Coreopsideae tribe will promote phylogenomics in Asteraceae plants, facilitate ornamental molecular breeding improvements and inulin production, and help prevent invasive weeds.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11170221/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141310461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae034
Xueyan Shen, Jie Hu, José M Yáñez, Giana Bastos Gomes, Zhi Weng Josiah Poon, Derick Foster, Jorge F Alarcon, Libin Shao, Xinyu Guo, Yunchang Shao, Roger Huerlimann, Chengze Li, Evan Goulden, Kelli Anderson, Guangyi Fan, Jose A Domingos
{"title":"Exploring the cobia (Rachycentron canadum) genome: unveiling putative male heterogametic regions and identification of sex-specific markers.","authors":"Xueyan Shen, Jie Hu, José M Yáñez, Giana Bastos Gomes, Zhi Weng Josiah Poon, Derick Foster, Jorge F Alarcon, Libin Shao, Xinyu Guo, Yunchang Shao, Roger Huerlimann, Chengze Li, Evan Goulden, Kelli Anderson, Guangyi Fan, Jose A Domingos","doi":"10.1093/gigascience/giae034","DOIUrl":"10.1093/gigascience/giae034","url":null,"abstract":"<p><strong>Background: </strong>Cobia (Rachycentron canadum) is the only member of the Rachycentridae family and exhibits considerable sexual dimorphism in growth rate. Sex determination in teleosts has been a long-standing basic biological question, and the molecular mechanisms of sex determination/differentiation in cobia are completely unknown.</p><p><strong>Results: </strong>Here, we reported 2 high-quality, chromosome-level annotated male and female cobia genomes with assembly sizes of 586.51 Mb (contig/scaffold N50: 86.0 kb/24.3 Mb) and 583.88 Mb (79.9 kb/22.5 Mb), respectively. Synteny inference among perciform genomes revealed that cobia and the remora Echeneis naucrates were sister groups. Further, whole-genome resequencing of 31 males and 60 females, genome-wide association study, and sequencing depth analysis identified 3 short male-specific regions within a 10.7-kb continuous genomic region on male chromosome 18, which hinted at an undifferentiated sex chromosome system with a putative XX/XY mode of sex determination in cobia. Importantly, the only 2 genes within/between the male-specific regions, epoxide hydrolase 1 (ephx1, renamed cephx1y) and transcription factor 24 (tcf24, renamed ctcf24y), showed testis-specific/biased gene expression, whereas their counterparts cephx1x and ctf24x, located in female chromosome 18, were similarly expressed in both sexes. In addition, male-specific PCR targeting the cephx1y gene revealed that this genomic feature is conserved in cobia populations from Panama, Brazil, Australia, and Japan.</p><p><strong>Conclusion: </strong>The first comprehensive genomic survey presented here is a valuable resource for future studies on cobia population structure and dynamics, conservation, and evolutionary history. Furthermore, it establishes evidence of putative male heterogametic regions with 2 genes playing a potential role in the sex determination of the species, and it provides further support for the rapid evolution of sex-determining mechanisms in teleost fish.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11240236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141590090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GSC: efficient lossless compression of VCF files with fast query.","authors":"Xiaolong Luo, Yuxin Chen, Ling Liu, Lulu Ding, Yuxiang Li, Shengkang Li, Yong Zhang, Zexuan Zhu","doi":"10.1093/gigascience/giae046","DOIUrl":"10.1093/gigascience/giae046","url":null,"abstract":"<p><strong>Background: </strong>With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.</p><p><strong>Findings: </strong>To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.</p><p><strong>Conclusion: </strong>GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11258903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141727098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}