arXiv - QuanBio - Genomics最新文献

筛选
英文 中文
Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation 一举解决蛋白质适宜性估算的伪复杂性问题
arXiv - QuanBio - Genomics Pub Date : 2024-07-09 DOI: arxiv-2407.07265
Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta
{"title":"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":"https://doi.org/arxiv-2407.07265","url":null,"abstract":"Protein language models trained on the masked language modeling objective\u0000learn to predict the identity of hidden amino acid residues within a sequence\u0000using the remaining observable sequence as context. They do so by embedding the\u0000residues into a high dimensional space that encapsulates the relevant\u0000contextual cues. These embedding vectors serve as an informative\u0000context-sensitive representation that not only aids with the defined training\u0000objective, but can also be used for other tasks by downstream models. We\u0000propose a scheme to use the embeddings of an unmasked sequence to estimate the\u0000corresponding masked probability vectors for all the positions in a single\u0000forward pass through the language model. This One Fell Swoop (OFS) approach\u0000allows us to efficiently estimate the pseudo-perplexity of the sequence, a\u0000measure of the model's uncertainty in its predictions, that can also serve as a\u0000fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\u0000well as the true pseudo-perplexity at fitness estimation, and more notably it\u0000defines a new state of the art on the ProteinGym Indels benchmark. The strong\u0000performance of the fitness measure prompted us to investigate if it could be\u0000used to detect the elevated stability reported in reconstructed ancestral\u0000sequences. We find that this measure ranks ancestral reconstructions as more\u0000fit than extant sequences. Finally, we show that the computational efficiency\u0000of the technique allows for the use of Monte Carlo methods that can rapidly\u0000explore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Metagenomic analysis reveals shared and distinguishing features in horse and donkey gut microbiome and maternal resemblance of the microbiota in hybrid equids 元基因组分析揭示了马和驴肠道微生物群的共同特征和不同特征,以及杂交马科动物微生物群的母源性相似性
arXiv - QuanBio - Genomics Pub Date : 2024-07-06 DOI: arxiv-2407.05076
Yihang Zhou
{"title":"Metagenomic analysis reveals shared and distinguishing features in horse and donkey gut microbiome and maternal resemblance of the microbiota in hybrid equids","authors":"Yihang Zhou","doi":"arxiv-2407.05076","DOIUrl":"https://doi.org/arxiv-2407.05076","url":null,"abstract":"Mammalian gut microbiomes are essential for host functions like digestion,\u0000immunity, and nutrient utilization. This study examines the gut microbiome of\u0000horses, donkeys, and their hybrids, mules and hinnies, to explore the role of\u0000microbiomes in hybrid vigor. We performed whole-genome sequencing on rectal\u0000microbiota from 18 equids, generating detailed microbiome assemblies. Our\u0000analysis revealed significant differences between horse and donkey microbiomes,\u0000with hybrids showing a pronounced maternal resemblance. Notably, Firmicutes\u0000were more abundant in the horse-maternal group, while Fibrobacteres were richer\u0000in the donkey-maternal group, indicating distinct digestive processes.\u0000Functional annotations indicated metabolic differences, such as protein\u0000synthesis in horses and energy metabolism in donkeys. Machine learning\u0000predictions of probiotic species highlighted potential health benefits for each\u0000maternal group. This study provides a high-resolution view of the equid gut\u0000microbiome, revealing significant taxonomic and metabolic differences\u0000influenced by maternal lineage, and offers insights into microbial\u0000contributions to hybrid vigor.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Metagenomic analysis revealed significant changes in cattle rectum microbiome and antimicrobial resistome under fescue toxicosis 元基因组分析揭示了羊茅中毒情况下牛直肠微生物组和抗菌药耐药性组的显著变化
arXiv - QuanBio - Genomics Pub Date : 2024-07-06 DOI: arxiv-2407.05055
Yihang Zhou
{"title":"Metagenomic analysis revealed significant changes in cattle rectum microbiome and antimicrobial resistome under fescue toxicosis","authors":"Yihang Zhou","doi":"arxiv-2407.05055","DOIUrl":"https://doi.org/arxiv-2407.05055","url":null,"abstract":"Fescue toxicity causes reduced growth and reproductive issues in cattle\u0000grazing endophyte-infected tall fescue. To characterize the gut microbiota and\u0000its response to fescue toxicosis, we collected fecal samples before and after a\u000030-days toxic fescue seeds supplementation from eight Angus Simmental pregnant\u0000cows and heifers. We sequenced the 16 metagenomes using the whole-genome\u0000shotgun approach and generated 157 Gbp of metagenomic sequences. Through de\u0000novo assembly and annotation, we obtained a 13.1 Gbp reference contig assembly\u0000and identified 22 million microbial genes for cattle rectum microbiota. We\u0000discovered a significant reduction of microbial diversity after toxic seed\u0000treatment (P<0.01), suggesting dysbiosis of the microbiome. Six bacterial\u0000families and 31 species are significantly increased in the fecal microbiota\u0000(P-adj<0.05), including members of the top abundant rumen core taxa. This\u0000global elevation of rumen microbes in the rectum microbiota suggests a\u0000potential impairment of rumen microbiota under fescue toxicosis. Among these,\u0000Ruminococcaceae bacterium P7, an important species accounting for ~2% of rumen\u0000microbiota, was the most impacted with a 16-fold increase from 0.17% to 2.8% in\u0000feces (P<0.01). We hypothesized that rumen Ruminococcaceae bacterium P7\u0000re-adapted to the large intestine environment under toxic fescue stress,\u0000causing this dramatic increase in abundance. Functional enrichment analysis\u0000revealed that the overrepresented pathways shifted from energy metabolism to\u0000antimicrobial resistance and DNA replication. In conclusion, we discovered\u0000dramatic microbiota alterations in composition, abundance, and functional\u0000capacities under fescue toxicosis, and our results suggest Ruminococcaceae\u0000bacterium P7 as a potential biomarker for fescue toxicosis management.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery Dy-mer:使用稀疏恢复的可解释 DNA 序列表示方案
arXiv - QuanBio - Genomics Pub Date : 2024-07-06 DOI: arxiv-2407.12051
Zhiyuan Peng, Yuanbo Tang, Yang Li
{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":"https://doi.org/arxiv-2407.12051","url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\u0000unfixed-length sequences cannot serve as the input of common data mining\u0000algorithms. Hence, various representation schemes have been developed to\u0000transform DNA sequences into fixed-length numerical representations. However,\u0000these schemes face difficulties in learning high-quality representations due to\u0000the complexity and sparsity of DNA data. Additionally, DNA sequences are\u0000inherently noisy because of mutations. While several schemes have been proposed\u0000for their effectiveness, they often lack semantic structure, making it\u0000difficult for biologists to validate and leverage the results. To address these\u0000challenges, we propose textbf{Dy-mer}, an explainable and robust DNA\u0000representation scheme based on sparse recovery. Leveraging the underlying\u0000semantic structure of DNA, we modify the traditional sparse recovery to capture\u0000recurring patterns indicative of biological functions by representing frequent\u0000K-mers as basis vectors and reconstructing each DNA sequence through simple\u0000concatenation. Experimental results demonstrate that textbf{Dy-mer} achieves\u0000state-of-the-art performance in DNA promoter classification, yielding a\u0000remarkable textbf{13%} increase in accuracy. Moreover, its inherent\u0000explainability facilitates DNA clustering and motif detection, enhancing its\u0000utility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantically Rich Local Dataset Generation for Explainable AI in Genomics 为基因组学中的可解释人工智能生成语义丰富的本地数据集
arXiv - QuanBio - Genomics Pub Date : 2024-07-03 DOI: arxiv-2407.02984
Pedro Barbosa, Rosina Savisaar, Alcides Fonseca
{"title":"Semantically Rich Local Dataset Generation for Explainable AI in Genomics","authors":"Pedro Barbosa, Rosina Savisaar, Alcides Fonseca","doi":"arxiv-2407.02984","DOIUrl":"https://doi.org/arxiv-2407.02984","url":null,"abstract":"Black box deep learning models trained on genomic sequences excel at\u0000predicting the outcomes of different gene regulatory mechanisms. Therefore,\u0000interpreting these models may provide novel insights into the underlying\u0000biology, supporting downstream biomedical applications. Due to their\u0000complexity, interpretable surrogate models can only be built for local\u0000explanations (e.g., a single instance). However, accomplishing this requires\u0000generating a dataset in the neighborhood of the input, which must maintain\u0000syntactic similarity to the original data while introducing semantic\u0000variability in the model's predictions. This task is challenging due to the\u0000complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving\u0000perturbations in sequences that contribute to their semantic diversity. Our\u0000custom, domain-guided individual representation effectively constrains\u0000syntactic similarity, and we provide two alternative fitness functions that\u0000promote diversity with no computational effort. Applied to the RNA splicing\u0000domain, our approach quickly achieves good diversity and significantly\u0000outperforms a random baseline in exploring the search space, as shown by our\u0000proof-of-concept, short RNA sequence. Furthermore, we assess its\u0000generalizability and demonstrate scalability to larger sequences, resulting in\u0000a $approx$30% improvement over the baseline.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing SeqMate:用于自动化 RNA 测序的新型大型语言模型管道
arXiv - QuanBio - Genomics Pub Date : 2024-07-02 DOI: arxiv-2407.03381
Devam Mondal, Atharva Inamdar
{"title":"SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing","authors":"Devam Mondal, Atharva Inamdar","doi":"arxiv-2407.03381","DOIUrl":"https://doi.org/arxiv-2407.03381","url":null,"abstract":"RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq,\u0000are critical tools for the biologist looking to analyze the genetic\u0000activity/transcriptome of a tissue or cell during an experimental procedure.\u0000Platforms like Illumina's next-generation sequencing (NGS) are used to produce\u0000the raw data for this experimental procedure. This raw FASTQ data must then be\u0000prepared via a complex series of data manipulations by bioinformaticians. This\u0000process currently takes place on an unwieldy textual user interface like a\u0000terminal/command line that requires the user to install and import multiple\u0000program packages, preventing the untrained biologist from initiating data\u0000analysis. Open-source platforms like Galaxy have produced a more user-friendly\u0000pipeline, yet the visual interface remains cluttered and highly technical,\u0000remaining uninviting for the natural scientist. To address this, SeqMate is a\u0000user-friendly tool that allows for one-click analytics by utilizing the power\u0000of a large language model (LLM) to automate both data preparation and analysis\u0000(differential expression, trajectory analysis, etc). Furthermore, by utilizing\u0000the power of generative AI, SeqMate is also capable of analyzing such findings\u0000and producing written reports of upregulated/downregulated/user-prompted genes\u0000with sources cited from known repositories like PubMed, PDB, and Uniprot.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences CGRclust:用于无标记 DNA 序列孪生对比聚类的混沌博弈表示法
arXiv - QuanBio - Genomics Pub Date : 2024-07-01 DOI: arxiv-2407.02538
Fatemeh Alipour, Kathleen A. Hill, Lila Kari
{"title":"CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences","authors":"Fatemeh Alipour, Kathleen A. Hill, Lila Kari","doi":"arxiv-2407.02538","DOIUrl":"https://doi.org/arxiv-2407.02538","url":null,"abstract":"This study proposes CGRclust, a novel combination of unsupervised twin\u0000contrastive clustering of Chaos Game Representations (CGR) of DNA sequences,\u0000with convolutional neural networks (CNNs). To the best of our knowledge,\u0000CGRclust is the first method to use unsupervised learning for image\u0000classification (herein applied to two-dimensional CGR images) for clustering\u0000datasets of DNA sequences. CGRclust overcomes the limitations of traditional\u0000sequence classification methods by leveraging unsupervised twin contrastive\u0000learning to detect distinctive sequence patterns, without requiring DNA\u0000sequence alignment or biological/taxonomic labels. CGRclust accurately\u0000clustered twenty-five diverse datasets, with sequence lengths ranging from 664\u0000bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as\u0000well as viral whole genome assemblies and synthetic DNA sequences. Compared\u0000with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and\u0000MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy\u0000across all four taxonomic levels tested for mitochondrial DNA genomes of fish.\u0000Moreover, CGRclust also consistently demonstrates superior performance across\u0000all the viral genomic datasets. The high clustering accuracy of CGRclust on\u0000these twenty-five datasets, which vary significantly in terms of sequence\u0000length, number of genomes, number of clusters, and level of taxonomy,\u0000demonstrates its robustness, scalability, and versatility.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing MegIS:利用存储内处理技术进行高性能、高能效、低成本的元基因组分析
arXiv - QuanBio - Genomics Pub Date : 2024-06-27 DOI: arxiv-2406.19113
Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu
{"title":"MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing","authors":"Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu","doi":"arxiv-2406.19113","DOIUrl":"https://doi.org/arxiv-2406.19113","url":null,"abstract":"Metagenomics has led to significant advances in many fields. Metagenomic\u0000analysis commonly involves the key tasks of determining the species present in\u0000a sample and their relative abundances. These tasks require searching large\u0000metagenomic databases. Metagenomic analysis suffers from significant data\u0000movement overhead due to moving large amounts of low-reuse data from the\u0000storage system. In-storage processing can be a fundamental solution for\u0000reducing this overhead. However, designing an in-storage processing system for\u0000metagenomics is challenging because existing approaches to metagenomic analysis\u0000cannot be directly implemented in storage effectively due to the hardware\u0000limitations of modern SSDs. We propose MegIS, the first in-storage processing\u0000system designed to significantly reduce the data movement overhead of the\u0000end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight\u0000design that effectively leverages and orchestrates processing inside and\u0000outside the storage system. We address in-storage processing challenges for\u0000metagenomics via specialized and efficient 1) task partitioning, 2)\u0000data/computation flow coordination, 3) storage technology-aware algorithmic\u0000optimizations, 4) data mapping, and 5) lightweight in-storage accelerators.\u0000MegIS's design is flexible, capable of supporting different types of\u0000metagenomic input datasets, and can be integrated into various metagenomic\u0000analysis pipelines. Our evaluation shows that MegIS outperforms the\u0000state-of-the-art performance- and accuracy-optimized software metagenomic tools\u0000by 2.7$times$-37.2$times$ and 6.9$times$-100.2$times$, respectively, while\u0000matching the accuracy of the accuracy-optimized tool. MegIS achieves\u00001.5$times$-5.1$times$ speedup compared to the state-of-the-art metagenomic\u0000hardware-accelerated (using processing-in-memory) tool, while achieving\u0000significantly higher accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online t-SNE for single-cell RNA-seq 用于单细胞 RNA-seq 的在线 t-SNE
arXiv - QuanBio - Genomics Pub Date : 2024-06-21 DOI: arxiv-2406.14842
Hui Ma, Kai Chen
{"title":"Online t-SNE for single-cell RNA-seq","authors":"Hui Ma, Kai Chen","doi":"arxiv-2406.14842","DOIUrl":"https://doi.org/arxiv-2406.14842","url":null,"abstract":"Due to the sequential sample arrival, changing experiment conditions, and\u0000evolution of knowledge, the demand to continually visualize evolving structures\u0000of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes\u0000indispensable. However, as one of the state-of-the-art visualization and\u0000analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding\u0000(t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the\u0000demand well. To address these challenges, we introduce online t-SNE to\u0000seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by\u0000leveraging the embedding space of old samples, exploring the embedding space of\u0000new samples, and aligning the two embedding spaces on the fly. Consequently,\u0000online t-SNE dramatically enables the continual discovery of new structures and\u0000high-quality visualization of new scRNA-seq data without retraining from\u0000scratch. We showcase the formidable visualization capabilities of online t-SNE\u0000across diverse sequential scRNA-seq datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians GenoTEX:与生物信息学家一起评估基于 LLM 的基因表达数据对齐探索的基准工具
arXiv - QuanBio - Genomics Pub Date : 2024-06-21 DOI: arxiv-2406.15341
Haoyang Liu, Haohan Wang
{"title":"GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians","authors":"Haoyang Liu, Haohan Wang","doi":"arxiv-2406.15341","DOIUrl":"https://doi.org/arxiv-2406.15341","url":null,"abstract":"Recent advancements in machine learning have significantly improved the\u0000identification of disease-associated genes from gene expression datasets.\u0000However, these processes often require extensive expertise and manual effort,\u0000limiting their scalability. Large Language Model (LLM)-based agents have shown\u0000promise in automating these tasks due to their increasing problem-solving\u0000abilities. To support the evaluation and development of such methods, we\u0000introduce GenoTEX, a benchmark dataset for the automatic exploration of gene\u0000expression data, involving the tasks of dataset selection, preprocessing, and\u0000statistical analysis. GenoTEX provides annotated code and results for solving a\u0000wide range of gene identification problems, in a full analysis pipeline that\u0000follows the standard of computational genomics. These annotations are curated\u0000by human bioinformaticians who carefully analyze the datasets to ensure\u0000accuracy and reliability. To provide baselines for these tasks, we present\u0000GenoAgents, a team of LLM-based agents designed with context-aware planning,\u0000iterative correction, and domain expert consultation to collaboratively explore\u0000gene datasets. Our experiments with GenoAgents demonstrate the potential of\u0000LLM-based approaches in genomics data analysis, while error analysis highlights\u0000the challenges and areas for future improvement. We propose GenoTEX as a\u0000promising resource for benchmarking and enhancing AI-driven methods for\u0000genomics data analysis. We make our benchmark publicly available at\u0000url{https://github.com/Liu-Hy/GenoTex}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信