arXiv - QuanBio - Genomics最新文献

筛选
英文 中文
scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM scGHSOM:利用生长分层 SOM 对单细胞和 CRISPR 数据进行分层聚类和可视化处理
arXiv - QuanBio - Genomics Pub Date : 2024-07-24 DOI: arxiv-2407.16984
Shang-Jung Wen, Jia-Ming Chang, Fang Yu
{"title":"scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM","authors":"Shang-Jung Wen, Jia-Ming Chang, Fang Yu","doi":"arxiv-2407.16984","DOIUrl":"https://doi.org/arxiv-2407.16984","url":null,"abstract":"High-dimensional single-cell data poses significant challenges in identifying\u0000underlying biological patterns due to the complexity and heterogeneity of\u0000cellular states. We propose a comprehensive gene-cell dependency visualization\u0000via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),\u0000specifically designed for analyzing high-dimensional single-cell data like\u0000single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples\u0000in a hierarchical structure such that the self-growth structure of clusters\u0000satisfies the required variations between and within. We propose a novel\u0000Significant Attributes Identification Algorithm to identify features that\u0000distinguish clusters. This algorithm pinpoints attributes with minimal\u0000variation within a cluster but substantial variation between clusters. These\u0000key attributes can then be used for targeted data retrieval and downstream\u0000analysis. Furthermore, we present two innovative visualization tools: Cluster\u0000Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights\u0000the distribution of specific features across the hierarchical structure of\u0000GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness\u0000based on chosen features. The Cluster Distribution Map depicts leaf clusters as\u0000circles on the GHSOM grid, with circle size reflecting cluster data size and\u0000color customizable to visualize features like cell type or other attributes. We\u0000apply our analysis to three single-cell datasets and one CRISPR dataset\u0000(cell-gene database) and evaluate clustering methods with internal and external\u0000CH and ARI scores. GHSOM performs well, being the best performer in internal\u0000evaluation (CH=4.2). In external evaluation, GHSOM has the third-best\u0000performance of all methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning GV-Rep:用于遗传变异表征学习的大规模数据集
arXiv - QuanBio - Genomics Pub Date : 2024-07-24 DOI: arxiv-2407.16940
Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang
{"title":"GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning","authors":"Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang","doi":"arxiv-2407.16940","DOIUrl":"https://doi.org/arxiv-2407.16940","url":null,"abstract":"Genetic variants (GVs) are defined as differences in the DNA sequences among\u0000individuals and play a crucial role in diagnosing and treating genetic\u0000diseases. The rapid decrease in next generation sequencing cost has led to an\u0000exponential increase in patient-level GV data. This growth poses a challenge\u0000for clinicians who must efficiently prioritize patient-specific GVs and\u0000integrate them with existing genomic databases to inform patient management. To\u0000addressing the interpretation of GVs, genomic foundation models (GFMs) have\u0000emerged. However, these models lack standardized performance assessments,\u0000leading to considerable variability in model evaluations. This poses the\u0000question: How effectively do deep learning methods classify unknown GVs and\u0000align them with clinically-verified GVs? We argue that representation learning,\u0000which transforms raw data into meaningful feature spaces, is an effective\u0000approach for addressing both indexing and classification challenges. We\u0000introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring\u0000variable-length contexts and detailed annotations, designed for deep learning\u0000models to learn GV representations across various traits, diseases, tissue\u0000types, and experimental contexts. Our contributions are three-fold: (i)\u0000Construction of a comprehensive dataset with 7 million records, each labeled\u0000with characteristics of the corresponding variants, alongside additional data\u0000from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant\u0000combinations, and 156 unique clinically verified GVs from real-world patients.\u0000(ii) Analysis of the structure and properties of the dataset. (iii)\u0000Experimentation of the dataset with pre-trained GFMs. The results show a\u0000significant gap between GFMs current capabilities and accurate GV\u0000representation. We hope this dataset will help advance genomic deep learning to\u0000bridge this gap.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction 基于 LSTM 自动编码器的深度神经网络用于大麦基因型到表型预测
arXiv - QuanBio - Genomics Pub Date : 2024-07-21 DOI: arxiv-2407.16709
Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu
{"title":"LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction","authors":"Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu","doi":"arxiv-2407.16709","DOIUrl":"https://doi.org/arxiv-2407.16709","url":null,"abstract":"Artificial Intelligence (AI) has emerged as a key driver of precision\u0000agriculture, facilitating enhanced crop productivity, optimized resource use,\u0000farm sustainability, and informed decision-making. Also, the expansion of\u0000genome sequencing technology has greatly increased crop genomic resources,\u0000deepening our understanding of genetic variation and enhancing desirable crop\u0000traits to optimize performance in various environments. There is increasing\u0000interest in using machine learning (ML) and deep learning (DL) algorithms for\u0000genotype-to-phenotype prediction due to their excellence in capturing complex\u0000interactions within large, high-dimensional datasets. In this work, we propose\u0000a new LSTM autoencoder-based model for barley genotype-to-phenotype prediction,\u0000specifically for flowering time and grain yield estimation, which could\u0000potentially help optimize yields and management practices. Our model\u0000outperformed the other baseline methods, demonstrating its potential in\u0000handling complex high-dimensional agricultural datasets and enhancing crop\u0000phenotype prediction performance.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language 结合 DNA 序列和自然语言的酶功能多模式预测基准数据集
arXiv - QuanBio - Genomics Pub Date : 2024-07-21 DOI: arxiv-2407.15888
Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun
{"title":"A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language","authors":"Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun","doi":"arxiv-2407.15888","DOIUrl":"https://doi.org/arxiv-2407.15888","url":null,"abstract":"Predicting gene function from its DNA sequence is a fundamental challenge in\u0000biology. Many deep learning models have been proposed to embed DNA sequences\u0000and predict their enzymatic function, leveraging information in public\u0000databases linking DNA sequences to an enzymatic function label. However, much\u0000of the scientific community's knowledge of biological function is not\u0000represented in these categorical labels, and is instead captured in\u0000unstructured text descriptions of mechanisms, reactions, and enzyme behavior.\u0000These descriptions are often captured alongside DNA sequences in biological\u0000databases, albeit in an unstructured manner. Deep learning of models predicting\u0000enzymatic function are likely to benefit from incorporating this multi-modal\u0000data encoding scientific knowledge of biological function. There is, however,\u0000no dataset designed for machine learning algorithms to leverage this\u0000multi-modal information. Here we propose a novel dataset and benchmark suite\u0000that enables the exploration and development of large multi-modal neural\u0000network models on gene DNA sequences and natural language descriptions of gene\u0000function. We present baseline performance on benchmarks for both unsupervised\u0000and supervised tasks that demonstrate the difficulty of this modeling\u0000objective, while demonstrating the potential benefit of incorporating\u0000multi-modal data types in function prediction compared to DNA sequences alone.\u0000Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141779051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpaDiT: Diffusion Transformer for Spatial Gene Expression Prediction using scRNA-seq SpaDiT:利用 scRNA-seq 进行空间基因表达预测的扩散变换器
arXiv - QuanBio - Genomics Pub Date : 2024-07-18 DOI: arxiv-2407.13182
Xiaoyu Li, Fangfang Zhu, Wenwen Min
{"title":"SpaDiT: Diffusion Transformer for Spatial Gene Expression Prediction using scRNA-seq","authors":"Xiaoyu Li, Fangfang Zhu, Wenwen Min","doi":"arxiv-2407.13182","DOIUrl":"https://doi.org/arxiv-2407.13182","url":null,"abstract":"The rapid development of spatial transcriptomics (ST) technologies is\u0000revolutionizing our understanding of the spatial organization of biological\u0000tissues. Current ST methods, categorized into next-generation sequencing-based\u0000(seq-based) and fluorescence in situ hybridization-based (image-based) methods,\u0000offer innovative insights into the functional dynamics of biological tissues.\u0000However, these methods are limited by their cellular resolution and the\u0000quantity of genes they can detect. To address these limitations, we propose\u0000SpaDiT, a deep learning method that utilizes a diffusion generative model to\u0000integrate scRNA-seq and ST data for the prediction of undetected genes. By\u0000employing a Transformer-based diffusion model, SpaDiT not only accurately\u0000predicts unknown genes but also effectively generates the spatial structure of\u0000ST genes. We have demonstrated the effectiveness of SpaDiT through extensive\u0000experiments on both seq-based and image-based ST data. SpaDiT significantly\u0000contributes to ST gene prediction methods with its innovative approach.\u0000Compared to eight leading baseline methods, SpaDiT achieved state-of-the-art\u0000performance across multiple metrics, highlighting its substantial\u0000bioinformatics contribution.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-cell 3D genome reconstruction in the haploid setting using rigidity theory 利用刚性理论重建单倍体环境中的单细胞三维基因组
arXiv - QuanBio - Genomics Pub Date : 2024-07-15 DOI: arxiv-2407.10700
Sean Dewar, Georg Grasegger, Kaie Kubjas, Fatemeh Mohammadi, Anthony Nixon
{"title":"Single-cell 3D genome reconstruction in the haploid setting using rigidity theory","authors":"Sean Dewar, Georg Grasegger, Kaie Kubjas, Fatemeh Mohammadi, Anthony Nixon","doi":"arxiv-2407.10700","DOIUrl":"https://doi.org/arxiv-2407.10700","url":null,"abstract":"This article considers the problem of 3-dimensional genome reconstruction for\u0000single-cell data, and the uniqueness of such reconstructions in the setting of\u0000haploid organisms. We consider multiple graph models as representations of this\u0000problem, and use techniques from graph rigidity theory to determine\u0000identifiability. Biologically, our models come from Hi-C data, microscopy data,\u0000and combinations thereof. Mathematically, we use unit ball and sphere packing\u0000models, as well as models consisting of distance and inequality constraints. In\u0000each setting, we describe and/or derive new results on realisability and\u0000uniqueness. We then propose a 3D reconstruction method based on semidefinite\u0000programming and apply it to synthetic and real data sets using our models.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models 全方位基因组将 RNA 序列与基因组基础模型中的二级结构对齐
arXiv - QuanBio - Genomics Pub Date : 2024-07-15 DOI: arxiv-2407.11242
Heng Yang, Ke Li
{"title":"OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models","authors":"Heng Yang, Ke Li","doi":"arxiv-2407.11242","DOIUrl":"https://doi.org/arxiv-2407.11242","url":null,"abstract":"The structures of RNA sequences play a vital role in various cellular\u0000processes, while existing genomic foundation models (FMs) have struggled with\u0000precise sequence-structure alignment, due to the complexity of exponential\u0000combinations of nucleotide bases. In this study, we introduce OmniGenome, a\u0000foundation model that addresses this critical challenge of sequence-structure\u0000alignment in RNA FMs. OmniGenome bridges the sequences with secondary\u0000structures using structure-contextualized modeling, enabling hard in-silico\u0000genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The\u0000results on two comprehensive genomic benchmarks show that OmniGenome achieves\u0000state-of-the-art performance on complex RNA subtasks. For example, OmniGenome\u0000solved 74% of complex puzzles, compared to SpliceBERT which solved only 3% of\u0000the puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour,\u0000while the existing methods usually allocate $24$ hours for each puzzle.\u0000Overall, OmniGenome establishes wide genomic application cases and offers\u0000profound insights into biological mechanisms from the perspective of\u0000sequence-structure alignment.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis CellAgent:用于单细胞数据自动分析的 LLM 驱动型多代理框架
arXiv - QuanBio - Genomics Pub Date : 2024-07-13 DOI: arxiv-2407.09811
Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng
{"title":"CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis","authors":"Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng","doi":"arxiv-2407.09811","DOIUrl":"https://doi.org/arxiv-2407.09811","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for\u0000biological research, as it enables the precise characterization of cellular\u0000heterogeneity. However, manual manipulation of various tools to achieve desired\u0000outcomes can be labor-intensive for researchers. To address this, we introduce\u0000CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework,\u0000specifically designed for the automatic processing and execution of scRNA-seq\u0000data analysis tasks, providing high-quality results with no human intervention.\u0000Firstly, to adapt general LLMs to the biological field, CellAgent constructs\u0000LLM-driven biological expert roles - planner, executor, and evaluator - each\u0000with specific responsibilities. Then, CellAgent introduces a hierarchical\u0000decision-making mechanism to coordinate these biological experts, effectively\u0000driving the planning and step-by-step execution of complex data analysis tasks.\u0000Furthermore, we propose a self-iterative optimization mechanism, enabling\u0000CellAgent to autonomously evaluate and optimize solutions, thereby guaranteeing\u0000output quality. We evaluate CellAgent on a comprehensive benchmark dataset\u0000encompassing dozens of tissues and hundreds of distinct cell types. Evaluation\u0000results consistently show that CellAgent effectively identifies the most\u0000suitable tools and hyperparameters for single-cell analysis tasks, achieving\u0000optimal performance. This automated framework dramatically reduces the workload\u0000for science data analyses, bringing us into the \"Agent for Science\" era.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313 FastImpute:开源、无参照基因型推算方法的基线 -- PRS313 案例研究
arXiv - QuanBio - Genomics Pub Date : 2024-07-12 DOI: arxiv-2407.09355
Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida
{"title":"FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313","authors":"Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida","doi":"arxiv-2407.09355","DOIUrl":"https://doi.org/arxiv-2407.09355","url":null,"abstract":"Genotype imputation enhances genetic data by predicting missing SNPs using\u0000reference haplotype information. Traditional methods leverage linkage\u0000disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity\u0000of LD structures between genotyped target sets and fully sequenced reference\u0000panels. Recently, reference-free deep learning-based methods have emerged,\u0000offering a promising alternative by predicting missing genotypes without\u0000external databases, thereby enhancing privacy and accessibility. However, these\u0000methods often produce models with tens of millions of parameters, leading to\u0000challenges such as the need for substantial computational resources to train\u0000and inefficiency for client-sided deployment. Our study addresses these\u0000limitations by introducing a baseline for a novel genotype imputation pipeline\u0000that supports client-sided imputation models generalizable across any\u0000genotyping chip and genomic region. This approach enhances patient privacy by\u0000performing imputation directly on edge devices. As a case study, we focus on\u0000PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk\u0000prediction. Utilizing consumer genetic panels such as 23andMe, our model\u0000democratizes access to personalized genetic insights by allowing 23andMe users\u0000to obtain their PRS313 score. We demonstrate that simple linear regression can\u0000significantly improve the accuracy of PRS313 scores when calculated using SNPs\u0000imputed from consumer gene panels, such as 23andMe. Our linear regression model\u0000achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with\u0000simple imputation (substituting missing SNPs with the minor allele frequency).\u0000These findings suggest that popular SNP analysis libraries could benefit from\u0000integrating linear regression models for genotype imputation, providing a\u0000viable and light-weight alternative to reference based imputation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism 分布式存储器中基于排序的高性能 k-mer 计数与灵活的混合并行性
arXiv - QuanBio - Genomics Pub Date : 2024-07-10 DOI: arxiv-2407.07718
Yifan Li, Giulia Guidi
{"title":"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":"https://doi.org/arxiv-2407.07718","url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\u0000technologies require advanced bioinformatics infrastructures for efficient data\u0000analysis. k-mer counting, the process of quantifying the frequency of\u0000fixed-length k DNA subsequences, is a fundamental step in various\u0000bioinformatics pipelines, including genome assembly and protein prediction. Due\u0000to the growing volume of data, the scaling of the counting process is critical.\u0000In the literature, distributed memory software uses hash tables, which exhibit\u0000poor cache friendliness and consume excessive memory. They often also lack\u0000support for flexible parallelism, which makes integration into existing\u0000bioinformatics pipelines difficult. In this work, we propose HySortK, a highly\u0000efficient sorting-based distributed memory k-mer counter. HySortK reduces the\u0000communication volume through a carefully designed communication scheme and\u0000domain-specific optimization strategies. Furthermore, we introduce an abstract\u0000task layer for flexible hybrid parallelism to address load imbalances in\u0000different scenarios. HySortK achieves a 2-10x speedup compared to the GPU\u0000baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\u0000achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\u0000Finally, we integrated HySortK into an existing genome assembly pipeline and\u0000achieved up to 1.8x speedup, proving its flexibility and practicality in\u0000real-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信