arXiv - QuanBio - Genomics最新文献_第6页

QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis QuST-LLM：整合大型语言模型进行综合空间转录组学分析

arXiv - QuanBio - Genomics Pub Date : 2024-06-20 DOI: arxiv-2406.14307

Chao Hui Huang

引用次数: 0

A mapping-free NLP-based technique for sequence search in Nanopore long-reads 基于无映射 NLP 技术的 Nanopore 长读数序列搜索技术

arXiv - QuanBio - Genomics Pub Date : 2024-06-20 DOI: arxiv-2406.14187

Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska

{"title":"A mapping-free NLP-based technique for sequence search in Nanopore long-reads","authors":"Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska","doi":"arxiv-2406.14187","DOIUrl":"https://doi.org/arxiv-2406.14187","url":null,"abstract":"In unforeseen situations, such as nuclear power plant's or civilian radiation\u0000accidents, there is a need for effective and computationally inexpensive\u0000methods to determine the expression level of a selected gene panel, allowing\u0000for rough dose estimates in thousands of donors. The new generation in-situ\u0000mapper, fast and of low energy consumption, working at the level of single\u0000nanopore output, is in demand. We aim to create a sequence identification tool\u0000that utilizes Natural Language Processing (NLP) techniques and ensures a high\u0000level of negative predictive value (NPV) compared to the classical approach.\u0000The training dataset consisted of RNASeq data from 6 samples. Having tested\u0000multiple NLP models, the best configuration analyses the entire sequence and\u0000uses a word length of 3 base pairs with one-word neighbor on each side. For the\u0000considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and\u0000NPV 99.25%, compared to minimap2's performance in a cross-validation scenario.\u0000Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to\u000098.15%. Obtained NLP model, validated on an external independent genome\u0000sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced\u0000dictionary. The salmon-estimated read counts differed from the classical\u0000approach on average by 3.48% for the complete dictionary and by 5.82% for the\u0000reduced one. We conclude that for long Oxford Nanopore reads, an NLP-based\u0000approach can successfully replace classical mapping in case of emergency. The\u0000developed NLP model can be easily retrained to identify selected transcripts\u0000and/or work with various long-read sequencing techniques. Our results of the\u0000study clearly demonstrate the potential of applying techniques known from\u0000classical text processing to nucleotide sequences and represent a significant\u0000advancement in this field of science.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design RNA-FrameFlow：从头开始三维 RNA 主干设计的流程匹配

arXiv - QuanBio - Genomics Pub Date : 2024-06-19 DOI: arxiv-2406.13839

Rishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro Liò

{"title":"RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design","authors":"Rishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro Liò","doi":"arxiv-2406.13839","DOIUrl":"https://doi.org/arxiv-2406.13839","url":null,"abstract":"We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone\u0000design. We build upon SE(3) flow matching for protein backbone generation and\u0000establish protocols for data preparation and evaluation to address unique\u0000challenges posed by RNA modeling. We formulate RNA structures as a set of\u0000rigid-body frames and associated loss functions which account for larger, more\u0000conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins\u0000(4 atoms per residue). Toward tackling the lack of diversity in 3D RNA\u0000datasets, we explore training with structural clustering and cropping\u0000augmentations. Additionally, we define a suite of evaluation metrics to measure\u0000whether the generated RNA structures are globally self-consistent (via inverse\u0000folding followed by forward folding) and locally recover RNA-specific\u0000structural descriptors. The most performant version of RNA-FrameFlow generates\u0000locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass\u0000our validity criteria as measured by a self-consistency TM-score >= 0.45, at\u0000which two RNAs have the same global fold. Open-source code:\u0000https://github.com/rish-16/rna-backbone-design","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model PathoLM：通过基因组基础模型从 DNA 序列识别致病性

arXiv - QuanBio - Genomics Pub Date : 2024-06-19 DOI: arxiv-2406.13133

Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang

{"title":"PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model","authors":"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang","doi":"arxiv-2406.13133","DOIUrl":"https://doi.org/arxiv-2406.13133","url":null,"abstract":"Pathogen identification is pivotal in diagnosing, treating, and preventing\u0000diseases, crucial for controlling infections and safeguarding public health.\u0000Traditional alignment-based methods, though widely used, are computationally\u0000intense and reliant on extensive reference databases, often failing to detect\u0000novel pathogens due to their low sensitivity and specificity. Similarly,\u0000conventional machine learning techniques, while promising, require large\u0000annotated datasets and extensive feature engineering and are prone to\u0000overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\u0000pathogen language model optimized for the identification of pathogenicity in\u0000bacterial and viral sequences. Leveraging the strengths of pre-trained DNA\u0000models such as the Nucleotide Transformer, PathoLM requires minimal data for\u0000fine-tuning, thereby enhancing pathogen detection capabilities. It effectively\u0000captures a broader genomic context, significantly improving the identification\u0000of novel and divergent pathogens. We developed a comprehensive data set\u0000comprising approximately 30 species of viruses and bacteria, including ESKAPEE\u0000pathogens, seven notably virulent bacterial strains resistant to antibiotics.\u0000Additionally, we curated a species classification dataset centered specifically\u0000on the ESKAPEE group. In comparative assessments, PathoLM dramatically\u0000outperforms existing models like DciPatho, demonstrating robust zero-shot and\u0000few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\u0000classification, where it showed superior performance compared to other advanced\u0000deep learning methods, despite the complexities of the task.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements skandiver：用于识别细胞间移动遗传因子的基于分异的分析工具

arXiv - QuanBio - Genomics Pub Date : 2024-06-17 DOI: arxiv-2406.12064

Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu

{"title":"skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements","authors":"Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu","doi":"arxiv-2406.12064","DOIUrl":"https://doi.org/arxiv-2406.12064","url":null,"abstract":"Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied\u0000in type, ranging from viral insertions to transposons to incorporated plasmids.\u0000Horizontal transfer of MGEs across bacterial species may also pose a\u0000significant threat to global health due to their capability to harbour\u0000antibiotic resistance genes. However, despite cheap and rapid whole genome\u0000sequencing, the varied nature of MGEs makes it difficult to fully characterize\u0000them, and existing methods for detecting MGEs often don't agree on what should\u0000count. In this manuscript, we first define and argue in favor of a\u0000divergence-based characterization of mobile-genetic elements. Using that\u0000paradigm, we present skandiver, a tool designed to efficiently detect MGEs from\u0000whole genome assemblies without the need for gene annotation or markers.\u0000skandiver determines mobile elements via genome fragmentation, average\u0000nucleotide identity (ANI), and divergence time. By building on the scalable\u0000skani software for ANI computation, skandiver can query hundreds of complete\u0000assemblies against $>$65,000 representative genomes in a few minutes and 19 GB\u0000memory, providing scalable and efficient method for elucidating mobile element\u0000profiles in incomplete, uncharacterized genomic sequences. For isolated and\u0000integrated large plasmids (>10kbp), skandiver's recall was 48% and 47%,\u0000MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%,\u0000respectively. For isolated large plasmids, skandiver's recall (48%) is lower\u0000than state-of-the-art reference-based methods geNomad (86%) and\u0000MobileElementFinder (59%). However, skandiver achieves higher recall on\u0000integrated plasmids and, unlike other methods, without comparing against a\u0000curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection pVACview：高效新抗原优先排序和选择的交互式可视化工具

arXiv - QuanBio - Genomics Pub Date : 2024-06-11 DOI: arxiv-2406.06985

Huiming Xia, My Hoang, Evelyn Schmidt, Susanna Kiwala, Joshua McMichael, Zachary L. Skidmore, Bryan Fisk, Jonathan J. Song, Jasreet Hundal, Thomas Mooney, Jason R. Walker, S. Peter Goedegebuure, Christopher A. Miller, William E. Gillanders, Obi L. Griffith, Malachi Griffith

{"title":"pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection","authors":"Huiming Xia, My Hoang, Evelyn Schmidt, Susanna Kiwala, Joshua McMichael, Zachary L. Skidmore, Bryan Fisk, Jonathan J. Song, Jasreet Hundal, Thomas Mooney, Jason R. Walker, S. Peter Goedegebuure, Christopher A. Miller, William E. Gillanders, Obi L. Griffith, Malachi Griffith","doi":"arxiv-2406.06985","DOIUrl":"https://doi.org/arxiv-2406.06985","url":null,"abstract":"Neoantigen targeting therapies including personalized vaccines have shown\u0000promise in the treatment of cancers. Accurate identification/prioritization of\u0000neoantigens is highly relevant to designing clinical trials, predicting\u0000treatment response, and understanding mechanisms of resistance. With the advent\u0000of massively parallel sequencing technologies, it is now possible to predict\u0000neoantigens based on patient-specific variant information. However, numerous\u0000factors must be considered when prioritizing neoantigens for use in\u0000personalized therapies. Complexities such as alternative transcript\u0000annotations, various binding, presentation and immunogenicity prediction\u0000algorithms, and variable peptide lengths/registers all potentially impact the\u0000neoantigen selection process. While computational tools generate numerous\u0000algorithmic predictions for neoantigen characterization, results from these\u0000pipelines are difficult to navigate and require extensive knowledge of the\u0000underlying tools for accurate interpretation. Due to the intricate nature and\u0000number of salient neoantigen features, presenting all relevant information to\u0000facilitate candidate selection for downstream applications is a difficult\u0000challenge that current tools fail to address. We have created pVACview, the\u0000first interactive tool designed to aid in the prioritization and selection of\u0000neoantigen candidates for personalized neoantigen therapies. pVACview has a\u0000user-friendly and intuitive interface where users can upload, explore, select\u0000and export their neoantigen candidates. The tool allows users to visualize\u0000candidates using variant, transcript and peptide information. pVACview will\u0000allow researchers to analyze and prioritize neoantigen candidates with greater\u0000efficiency and accuracy in basic and translational settings. The application is\u0000available as part of the pVACtools pipeline at pvactools.org and as an online\u0000server at pvacview.org.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization 单细胞基因组学中的强化基因选择：预过滤协同作用和强化优化

arXiv - QuanBio - Genomics Pub Date : 2024-06-11 DOI: arxiv-2406.07418

Weiliang Zhang, Zhen Meng, Dongjie Wang, Min Wu, Kunpeng Liu, Yuanchun Zhou, Meng Xiao

{"title":"Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization","authors":"Weiliang Zhang, Zhen Meng, Dongjie Wang, Min Wu, Kunpeng Liu, Yuanchun Zhou, Meng Xiao","doi":"arxiv-2406.07418","DOIUrl":"https://doi.org/arxiv-2406.07418","url":null,"abstract":"Recent advancements in single-cell genomics necessitate precision in gene\u0000panel selection to interpret complex biological data effectively. Those methods\u0000aim to streamline the analysis of scRNA-seq data by focusing on the most\u0000informative genes that contribute significantly to the specific analysis task.\u0000Traditional selection methods, which often rely on expert domain knowledge,\u0000embedded machine learning models, or heuristic-based iterative optimization,\u0000are prone to biases and inefficiencies that may obscure critical genomic\u0000signals. Recognizing the limitations of traditional methods, we aim to\u0000transcend these constraints with a refined strategy. In this study, we\u0000introduce an iterative gene panel selection strategy that is applicable to\u0000clustering tasks in single-cell genomics. Our method uniquely integrates\u0000results from other gene selection algorithms, providing valuable preliminary\u0000boundaries or prior knowledge as initial guides in the search space to enhance\u0000the efficiency of our framework. Furthermore, we incorporate the stochastic\u0000nature of the exploration process in reinforcement learning (RL) and its\u0000capability for continuous optimization through reward-based feedback. This\u0000combination mitigates the biases inherent in the initial boundaries and\u0000harnesses RL's adaptability to refine and target gene panel selection\u0000dynamically. To illustrate the effectiveness of our method, we conducted\u0000detailed comparative experiments, case studies, and visualization analysis.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level 利用单细胞全息数据的数据挖掘方法评估纯组织环境对基因表达水平的影响

arXiv - QuanBio - Genomics Pub Date : 2024-06-11 DOI: arxiv-2406.06969

Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi

{"title":"Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level","authors":"Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi","doi":"arxiv-2406.06969","DOIUrl":"https://doi.org/arxiv-2406.06969","url":null,"abstract":"While single-cell RNA-seq enables the investigation of the celltype effect on\u0000the transcriptome, the pure tissue environmental effect has not been well\u0000investigated. The bias in the combination of tissue and celltype in the body\u0000made it difficult to evaluate the effect of pure tissue environment by omics\u0000data mining. It is important to prevent statistical confounding among discrete\u0000variables such as celltype, tissue, and other categorical variables when\u0000evaluating the effects of these variables. We propose a novel method to\u0000enumerate suitable analysis units of variables for estimating the effects of\u0000tissue environment by extending the maximal biclique enumeration problem for\u0000bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to\u0000a large mouse single-cell transcriptome dataset of Tabala Muris Senis to\u0000evaluate pure tissue environmental effects on gene expression. Data Mining\u0000using the proposed method revealed pure tissue environment effects on gene\u0000expression and its age-related change among adipose sub-tissues. The method\u0000proposed in this study helps evaluations of the effects of discrete variables\u0000in exploratory data mining of large-scale genomics datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics STimage-1K4M：用于空间转录组学的组织病理学图像-基因表达数据集

arXiv - QuanBio - Genomics Pub Date : 2024-06-10 DOI: arxiv-2406.06393

Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li

{"title":"STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics","authors":"Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li","doi":"arxiv-2406.06393","DOIUrl":"https://doi.org/arxiv-2406.06393","url":null,"abstract":"Recent advances in multi-modal algorithms have driven and been driven by the\u0000increasing availability of large image-text datasets, leading to significant\u0000strides in various fields, including computational pathology. However, in most\u0000existing medical image-text datasets, the text typically provides high-level\u0000summaries that may not sufficiently describe sub-tile regions within a large\u0000pathology image. For example, an image might cover an extensive tissue area\u0000containing cancerous and healthy regions, but the accompanying text might only\u0000specify that this image is a cancer slide, lacking the nuanced details needed\u0000for in-depth analysis. In this study, we introduce STimage-1K4M, a novel\u0000dataset designed to bridge this gap by providing genomic features for sub-tile\u0000images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics\u0000data, which captures gene expression information at the level of individual\u0000spatial spots within a pathology image. Specifically, each image in the dataset\u0000is broken down into smaller sub-image tiles, with each tile paired with\u000015,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile\u0000images and gene expressions, STimage-1K4M offers unprecedented granularity,\u0000paving the way for a wide range of advanced research in multi-modal data\u0000analysis an innovative applications in computational pathology, and beyond.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models GenBench：用于系统评估基因组基础模型的基准套件

arXiv - QuanBio - Genomics Pub Date : 2024-06-01 DOI: arxiv-2406.01627

Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li

{"title":"GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models","authors":"Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li","doi":"arxiv-2406.01627","DOIUrl":"https://doi.org/arxiv-2406.01627","url":null,"abstract":"The Genomic Foundation Model (GFM) paradigm is expected to facilitate the\u0000extraction of generalizable representations from massive genomic data, thereby\u0000enabling their application across a spectrum of downstream applications.\u0000Despite advancements, a lack of evaluation framework makes it difficult to\u0000ensure equitable assessment due to experimental settings, model intricacy,\u0000benchmark datasets, and reproducibility challenges. In the absence of\u0000standardization, comparative analyses risk becoming biased and unreliable. To\u0000surmount this impasse, we introduce GenBench, a comprehensive benchmarking\u0000suite specifically tailored for evaluating the efficacy of Genomic Foundation\u0000Models. GenBench offers a modular and expandable framework that encapsulates a\u0000variety of state-of-the-art methodologies. Through systematic evaluations of\u0000datasets spanning diverse biological domains with a particular emphasis on both\u0000short-range and long-range genomic tasks, firstly including the three most\u0000important DNA tasks covering Coding Region, Non-Coding Region, Genome\u0000Structure, etc. Moreover, We provide a nuanced analysis of the interplay\u0000between model architecture and dataset characteristics on task-specific\u0000performance. Our findings reveal an interesting observation: independent of the\u0000number of parameters, the discernible difference in preference between the\u0000attention-based and convolution-based models on short- and long-range tasks may\u0000provide insights into the future design of GFM.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0