GigaScience最新文献_第8页

ColoPola: A polarimetric imaging dataset for colorectal cancer detection. ColoPola：用于结直肠癌检测的偏振成像数据集。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf120

Thi-Thu-Hien Pham, Quoc-Hoang-Quyen Vo, Thao-Vi Nguyen, The-Hiep Nguyen, Quoc-Hung Phan, Thanh-Hai Le

{"title":"ColoPola: A polarimetric imaging dataset for colorectal cancer detection.","authors":"Thi-Thu-Hien Pham, Quoc-Hoang-Quyen Vo, Thao-Vi Nguyen, The-Hiep Nguyen, Quoc-Hung Phan, Thanh-Hai Le","doi":"10.1093/gigascience/giaf120","DOIUrl":"10.1093/gigascience/giaf120","url":null,"abstract":"Background: In recent years, polarimetric imaging has been developed for various biological applications, including tissue morphological characterization and cancer stage detection. However, to facilitate classification models based on the characteristics of polarization states, it is essential to develop a consistent and standardized dataset of polarimetric images.Findings: This study presents a dataset of colorectal cancer polarimetric images designated as ColoPola, which is intended to facilitate research efforts in the field. The dataset consists of 572 sample slices (288 healthy and 284 malignant). For each slice, 36 polarimetric images corresponding to different polarization states are provided. Thus, ColoPola contains 20,592 polarimetric images, of which 10,368 correspond to healthy samples and 10,224 to malignant samples. To the best of the authors' knowledge, the dataset is the first of its kind for colorectal cancer images. The practical utility of the dataset is evaluated using 5 models: 3 models constructed from scratch (CNN, CNN_2, and EfficientFormerV2) and 2 pretrained models (DenseNet and EfficientNetV2). For each model, the input has a size of 224 × 224 × 36, corresponding to the width, height, and red channel value of the polarimetric images, respectively.Conclusions: The results show that the CNN, CNN_2, EfficientFormerV2, DenseNet, and EfficientNetV2 models obtain F1 scores of 0.870, 0.862, 0.908, 0.903, and 0.965, respectively, on the testing set. Among the 5 models, EfficientNetV2 achieves the best performance, with all the performance metrics exceeding 0.95 for both the validation set and the testing set. Overall, the results suggest that ColoPola has significant potential as a polarimetric optical imaging-based diagnostic tool for colorectal cancer in clinical practice.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530094/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145307648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge graph-based thought: a knowledge graph-enhanced LLM framework for pan-cancer question answering. 基于知识图的思想：面向泛癌症问答的知识图增强LLM框架。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giae082

Yichun Feng, Lu Zhou, Chao Ma, Yikai Zheng, Ruikun He, Yixue Li

{"title":"Knowledge graph-based thought: a knowledge graph-enhanced LLM framework for pan-cancer question answering.","authors":"Yichun Feng, Lu Zhou, Chao Ma, Yikai Zheng, Ruikun He, Yixue Li","doi":"10.1093/gigascience/giae082","DOIUrl":"10.1093/gigascience/giae082","url":null,"abstract":"Background: In recent years, large language models (LLMs) have shown promise in various domains, notably in biomedical sciences. However, their real-world application is often limited by issues like erroneous outputs and hallucinatory responses.Results: We developed the knowledge graph-based thought (KGT) framework, an innovative solution that integrates LLMs with knowledge graphs (KGs) to improve their initial responses by utilizing verifiable information from KGs, thus significantly reducing factual errors in reasoning. The KGT framework demonstrates strong adaptability and performs well across various open-source LLMs. Notably, KGT can facilitate the discovery of new uses for existing drugs through potential drug-cancer associations and can assist in predicting resistance by analyzing relevant biomarkers and genetic mechanisms. To evaluate the knowledge graph question answering task within biomedicine, we utilize a pan-cancer knowledge graph to develop a pan-cancer question answering benchmark, named pan-cancer question answering.Conclusions: The KGT framework substantially improves the accuracy and utility of LLMs in the biomedical field. This study serves as a proof of concept, demonstrating its exceptional performance in biomedical question answering.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11702363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142947471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characteristics and filtering of low-frequency artificial short deletion variations based on nanopore sequencing. 基于纳米孔测序的低频人工短缺失变异特征及筛选。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf018

Fuqiang Ye, Juanjuan Zhu, Xiaomin Zhang, Jiarong Zhang, Zihan Xie, Tingting Yang, Yifang Han, Xiaohong Yang, Zilin Ren, Ming Ni

{"title":"Characteristics and filtering of low-frequency artificial short deletion variations based on nanopore sequencing.","authors":"Fuqiang Ye, Juanjuan Zhu, Xiaomin Zhang, Jiarong Zhang, Zihan Xie, Tingting Yang, Yifang Han, Xiaohong Yang, Zilin Ren, Ming Ni","doi":"10.1093/gigascience/giaf018","DOIUrl":"10.1093/gigascience/giaf018","url":null,"abstract":"Background: Nanopore sequencing is characterized by high portability and long reads, albeit accompanied by systematic errors causing short deletions. Few tools can filter low-frequency artificial deletions, especially in single samples.Results: To solve this problem, we first synthesized or purchased 17 DNA/RNA standards for nanopore sequencing with R9 and R10 flowcells to obtain benchmarking datasets. False-positive (FP) deletions were prevalent (75.86%-96.26%), while the majority (62.07%-79.68%) were located in homopolymeric regions. The 10-mer base-quality scores (Q scores) and sequencing speeds flanking the FP homopolymeric deletions marginally differed from the true-positive (TP) deletions. We thus investigated the raw current signals after normalizing them by length. We found more significant differences in current signals between the reads with and without FP deletions. Indexes including the MRPP A (Multiple Response Permutation Procedure, statistic A), the accumulative difference of normalized current signals, and the Q score were tested for the power of distinguishing between FP and TP deletions. MRPP A outperformed the other indexes in homopolymeric regions and achieved the highest accuracy of 76.73% for challenging 1-base homopolymeric deletions. When sequencing depth was low, the Q score performed better than MRPP A. We developed Delter (Deletion filter) to filter low-frequency FP deletions of nanopore sequencing in single samples, which removed 60.98% to 100% of artificial homopolymeric deletions in real samples.Conclusions: Low-frequency artificial short deletion variations, especially the most challenging homopolymeric deletions, could be effectively filtered by Delter using normalized current signals or Q scores according to the employed sequencing strategies.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11927395/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A telomere-to-telomere phased genome of an octoploid strawberry reveals a receptor kinase conferring anthracnose resistance. 八倍体草莓的端粒到端粒相基因组揭示了一种赋予炭疽病抗性的受体激酶。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf005

Hyeondae Han, Natalia Salinas, Christopher R Barbey, Yoon Jeong Jang, Zhen Fan, Sujeet Verma, Vance M Whitaker, Seonghee Lee

{"title":"A telomere-to-telomere phased genome of an octoploid strawberry reveals a receptor kinase conferring anthracnose resistance.","authors":"Hyeondae Han, Natalia Salinas, Christopher R Barbey, Yoon Jeong Jang, Zhen Fan, Sujeet Verma, Vance M Whitaker, Seonghee Lee","doi":"10.1093/gigascience/giaf005","DOIUrl":"10.1093/gigascience/giaf005","url":null,"abstract":"Background: Cultivated strawberry (Fragaria xananassa Duch.), an allo-octoploid species arising from at least 3 diploid progenitors, poses a challenge for genomic analysis due to its high levels of heterozygosity and the complex nature of its polyploid genome.Results: This study developed the complete haplotype-phased genome sequence from a short-day strawberry, 'Florida Brilliance' without parental data, assembling 56 chromosomes from telomere to telomere. This assembly was achieved with high-fidelity long reads and high-throughput chromatic capture sequencing (Hi-C). The centromere core regions and 96,104 genes were annotated using long-read isoform RNA sequencing. Using the high quality of the haplotype-phased reference genome, FaFB1, we identified the causal mutation within the gene encoding Leaf Rust 10 Disease-Resistance Locus Receptor-like Protein Kinase (LRK10) that confers resistance to anthracnose fruit rot (AFR). This disease is caused by the Colletotrichum acutatum species complex and results in significant economic losses in strawberry production. Comparison of resistant and susceptible haplotype assemblies and full-length transcript data revealed a 29-bp insertion at the first exon of the susceptible allele, leading to a premature stop codon and loss of gene function. The functional role of LRK10 in resistance to AFR was validated using a simplified Agrobacterium-based transformation method for transient gene expression analysis in strawberry fruits. Transient knockdown and overexpression of LRK10 in fruit indicate a key role for LRK10 in AFR resistance in strawberry.Conclusions: The FaFB1 assembly along with other resources will be valuable for the discovery of additional candidate genes associated with disease resistance and fruit quality, which will not only advance our understanding of genes and their functions but also facilitate advancements in genome editing in strawberry.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899574/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143614573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Healthy microbiome-moving towards functional interpretation. 健康微生物群-向功能解释迈进。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf015

Kinga Zielińska, Klas I Udekwu, Witold Rudnicki, Alina Frolova, Paweł P Łabaj

{"title":"Healthy microbiome-moving towards functional interpretation.","authors":"Kinga Zielińska, Klas I Udekwu, Witold Rudnicki, Alina Frolova, Paweł P Łabaj","doi":"10.1093/gigascience/giaf015","DOIUrl":"10.1093/gigascience/giaf015","url":null,"abstract":"Background: Microbiome-based disease prediction has significant potential as an early, noninvasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome's species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity, revealing substantial restrictions of taxonomy-reliant approaches.Findings: In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification toward a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high-dimensional principal component analysis (hiPCA) methods, as well as to the standard taxon- and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index's ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth.Conclusions: Overall, we emphasize the potential of this metagenomic approach and advocate a shift toward functional approaches to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11927397/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications. 联邦知识图上的生物信息学问题-查询对的大集合：方法和应用。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf045

Jerven Bolleman, Vincent Emonet, Adrian Altenhoff, Amos Bairoch, Marie-Claude Blatter, Alan Bridge, Séverine Duvaud, Elisabeth Gasteiger, Dmitry Kuznetsov, Sébastien Moretti, Pierre-Andre Michel, Anne Morgat, Marco Pagni, Nicole Redaschi, Monique Zahn-Zabal, Tarcisio Mendes de Farias, Ana Claudia Sima

{"title":"A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications.","authors":"Jerven Bolleman, Vincent Emonet, Adrian Altenhoff, Amos Bairoch, Marie-Claude Blatter, Alan Bridge, Séverine Duvaud, Elisabeth Gasteiger, Dmitry Kuznetsov, Sébastien Moretti, Pierre-Andre Michel, Anne Morgat, Marco Pagni, Nicole Redaschi, Monique Zahn-Zabal, Tarcisio Mendes de Farias, Ana Claudia Sima","doi":"10.1093/gigascience/giaf045","DOIUrl":"10.1093/gigascience/giaf045","url":null,"abstract":"Background: In recent decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning (for example, machine-learning algorithms for translating natural language questions to SPARQL), if a sufficiently large number of examples are provided and published in a common, machine-readable, and standardized format across different resources.Findings: We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1,000 example questions and queries, including almost 100 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.Conclusions: We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. URL: https://github.com/sib-swiss/sparql-examples.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083453/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144077456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpatialSNV: A novel method for identifying and analyzing spatially resolved SNVs in tumor microenvironments. SpatialSNV：一种识别和分析肿瘤微环境中空间分辨snv的新方法。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf065

Yi Liu, Fan Zhu, Xinxing Li, Xiangyu Guan, Yong Hou, Yu Feng, Xuan Dong, Young Li

{"title":"SpatialSNV: A novel method for identifying and analyzing spatially resolved SNVs in tumor microenvironments.","authors":"Yi Liu, Fan Zhu, Xinxing Li, Xiangyu Guan, Yong Hou, Yu Feng, Xuan Dong, Young Li","doi":"10.1093/gigascience/giaf065","DOIUrl":"10.1093/gigascience/giaf065","url":null,"abstract":"Background: The dynamics of single-nucleotide variants (SNVs) play a critical role in understanding tumor development, yet their influence on shaping tumor microenvironments remains largely unexplored. Spatial transcriptomics offers an opportunity to map SNVs within the tumor context, potentially uncovering new insights into tumor microenvironment dynamics.Results: This study developed SpatialSNV for identifying effective SNVs across tumor sections using multiple spatial transcriptomics platforms. The analysis revealed that SNVs reflect regional tumor evolutionary traces and extend beyond RNA expression changes. The tumor margins exhibited a distinct mutational profile, with novel SNVs diminishing in a distance-dependent manner from the tumor boundary. These mutations were significantly linked to inflammatory and hypoxic microenvironments. Furthermore, spatially correlated SNV groups were identified, exhibiting distinct spatial patterns and implicating specific roles in tumor-immune system crosstalk. Among these, critical SNVs such as S100A11L40P in colorectal cancer were identified as tumor region-specific mutations. This mutation, located within exonic nonsynonymous regions, may produce neoantigens presented by HLAs, marking it as a potential therapeutic target.Conclusions: SpatialSNV represents a promising framework for unraveling the mechanisms underlying tumor-immune crosstalk within the tumor microenvironment by leveraging spatial transcriptomics and SNV-based tissue domain characterization. This approach is designed to be scalable, integrative, and adaptable, making it accessible to researchers aiming to explore tumor heterogeneity and identify therapeutic targets.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12166308/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144293647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deterministic succession patterns in the rumen and fecal microbiome associate with host metabolic shifts in peripartum dairy cattle. 围产期奶牛瘤胃和粪便微生物组的确定性演替模式与宿主代谢变化相关。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf042

Shuo Wang, Fanlin Kong, Dongwen Dai, Chen Li, Yangyi Hao, Erdan Wang, Zhijun Cao, Yajing Wang, Wei Wang, Shengli Li

{"title":"Deterministic succession patterns in the rumen and fecal microbiome associate with host metabolic shifts in peripartum dairy cattle.","authors":"Shuo Wang, Fanlin Kong, Dongwen Dai, Chen Li, Yangyi Hao, Erdan Wang, Zhijun Cao, Yajing Wang, Wei Wang, Shengli Li","doi":"10.1093/gigascience/giaf042","DOIUrl":"10.1093/gigascience/giaf042","url":null,"abstract":"Background: Metabolic disorders in peripartum ruminants affect health and productivity, with gut microbiota playing a key role in host metabolism. Therefore, our study aimed to characterize the gut microbiota of peripartum dairy cows to better understand the relationship between metabolic phenotypes and the rumen and fecal microbiomes during the peripartum period.Results: In a longitudinal study of 91 peripartum cows, we analyzed rumen and fecal microbiomes via 16S rRNA and metagenomic sequencing across six time points. By using enterotype classification, ecological model, and random forest analysis, we identified distinct deterministic succession patterns in the rumen and fecal microbiomes (rumen: rapid transition-transition-stable; hindgut: stable-transition-stable). Key microbes, such as Succiniclasticum and Bifidobacterium, were found to drive microbial succession by balancing stochastic and deterministic processes. Notably, we observed that changes in gut microbiota succession patterns significantly influenced metabolic phenotypes (e.g., serum non-esterified fatty acid, glucose, and insulin levels). Mediation analysis suggested that specific gut microbes (e.g., Prevotella sp900315525 in the rumen and Alistipes sp015059845 in the hindgut) and metabolic pathways (e.g., glucose-related pathway) were associated with host metabolic phenotypes.Conclusions: Overall, utilizing a large gut microbiome dataset and enterotype- and ecological model-based microbiome analyses, we comprehensively elucidated the succession and assembly of the gut microbiota in peripartum dairy cows. We further confirmed that changes in gut microbiota succession patterns were significantly related to the metabolic phenotypes of peripartum dairy cows. These findings provide valuable insights for developing health management strategies for peripartum ruminants.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12087452/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144101501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PISAD: reference-free intraspecies sample anomalies detection tool based on k-mer counting. PISAD：基于k-mer计数的无参考种内样本异常检测工具。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf061

Zhantian Xu, Fan Nie, Jianxin Wang

{"title":"PISAD: reference-free intraspecies sample anomalies detection tool based on k-mer counting.","authors":"Zhantian Xu, Fan Nie, Jianxin Wang","doi":"10.1093/gigascience/giaf061","DOIUrl":"10.1093/gigascience/giaf061","url":null,"abstract":"Background: Genomic sequencing research often requires the simultaneous analysis of heterogeneous data types across single or multiple individuals, introducing a substantial risk of sample swaps (e.g., labeling errors). Existing methods primarily rely on reference information, requiring the preselection of informative variant sites with a population allele frequency around 0.5, which may be insufficient or unavailable for nonmodel organisms. As research expands to encompass a growing number of new species, a robust quality control tool will become increasingly important.Finds: We developed PISAD (Phased Intraspecies Sample Anomalies Detection), a tool for validating sample identities in whole-genome sequencing (WGS) data without requiring reference information. It uses a 2-stage approach: first, it performs rapid, reference-free single nucleotide polymorphism (SNP) calling on low-error-rate data from the target individual to create a variant sketch; then, it assesses the concordance of other samples on this sketch to verify relationships. We assessed the performance and efficiency of PISAD on Homo sapiens, Bos taurus, Gallus gallus, Arctia plantaginis, and Pyrus species.Conclusions: Our evaluation showed that PISAD achieves a lower data coverage requirement (0.5×) compared to the reference-based tool ntsm and is broadly applicable to multiple diploid species.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12202988/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144316596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analysis-ready VCF at Biobank scale using Zarr. 使用Zarr在Biobank规模上分析就绪的VCF。

IF 11.8 2区生物学

GigaScience Pub Date : 2025-01-06 DOI: 10.1093/gigascience/giaf049

Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher

{"title":"Analysis-ready VCF at Biobank scale using Zarr.","authors":"Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher","doi":"10.1093/gigascience/giaf049","DOIUrl":"10.1093/gigascience/giaf049","url":null,"abstract":"Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results: Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12127038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144198846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0