Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta
{"title":"Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation","authors":"Pranav Kantroo, Günter P. Wagner, Benjamin B. Machta","doi":"arxiv-2407.07265","DOIUrl":"https://doi.org/arxiv-2407.07265","url":null,"abstract":"Protein language models trained on the masked language modeling objective\u0000learn to predict the identity of hidden amino acid residues within a sequence\u0000using the remaining observable sequence as context. They do so by embedding the\u0000residues into a high dimensional space that encapsulates the relevant\u0000contextual cues. These embedding vectors serve as an informative\u0000context-sensitive representation that not only aids with the defined training\u0000objective, but can also be used for other tasks by downstream models. We\u0000propose a scheme to use the embeddings of an unmasked sequence to estimate the\u0000corresponding masked probability vectors for all the positions in a single\u0000forward pass through the language model. This One Fell Swoop (OFS) approach\u0000allows us to efficiently estimate the pseudo-perplexity of the sequence, a\u0000measure of the model's uncertainty in its predictions, that can also serve as a\u0000fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as\u0000well as the true pseudo-perplexity at fitness estimation, and more notably it\u0000defines a new state of the art on the ProteinGym Indels benchmark. The strong\u0000performance of the fitness measure prompted us to investigate if it could be\u0000used to detect the elevated stability reported in reconstructed ancestral\u0000sequences. We find that this measure ranks ancestral reconstructions as more\u0000fit than extant sequences. Finally, we show that the computational efficiency\u0000of the technique allows for the use of Monte Carlo methods that can rapidly\u0000explore functional sequence space.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Metagenomic analysis reveals shared and distinguishing features in horse and donkey gut microbiome and maternal resemblance of the microbiota in hybrid equids","authors":"Yihang Zhou","doi":"arxiv-2407.05076","DOIUrl":"https://doi.org/arxiv-2407.05076","url":null,"abstract":"Mammalian gut microbiomes are essential for host functions like digestion,\u0000immunity, and nutrient utilization. This study examines the gut microbiome of\u0000horses, donkeys, and their hybrids, mules and hinnies, to explore the role of\u0000microbiomes in hybrid vigor. We performed whole-genome sequencing on rectal\u0000microbiota from 18 equids, generating detailed microbiome assemblies. Our\u0000analysis revealed significant differences between horse and donkey microbiomes,\u0000with hybrids showing a pronounced maternal resemblance. Notably, Firmicutes\u0000were more abundant in the horse-maternal group, while Fibrobacteres were richer\u0000in the donkey-maternal group, indicating distinct digestive processes.\u0000Functional annotations indicated metabolic differences, such as protein\u0000synthesis in horses and energy metabolism in donkeys. Machine learning\u0000predictions of probiotic species highlighted potential health benefits for each\u0000maternal group. This study provides a high-resolution view of the equid gut\u0000microbiome, revealing significant taxonomic and metabolic differences\u0000influenced by maternal lineage, and offers insights into microbial\u0000contributions to hybrid vigor.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"368 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Metagenomic analysis revealed significant changes in cattle rectum microbiome and antimicrobial resistome under fescue toxicosis","authors":"Yihang Zhou","doi":"arxiv-2407.05055","DOIUrl":"https://doi.org/arxiv-2407.05055","url":null,"abstract":"Fescue toxicity causes reduced growth and reproductive issues in cattle\u0000grazing endophyte-infected tall fescue. To characterize the gut microbiota and\u0000its response to fescue toxicosis, we collected fecal samples before and after a\u000030-days toxic fescue seeds supplementation from eight Angus Simmental pregnant\u0000cows and heifers. We sequenced the 16 metagenomes using the whole-genome\u0000shotgun approach and generated 157 Gbp of metagenomic sequences. Through de\u0000novo assembly and annotation, we obtained a 13.1 Gbp reference contig assembly\u0000and identified 22 million microbial genes for cattle rectum microbiota. We\u0000discovered a significant reduction of microbial diversity after toxic seed\u0000treatment (P<0.01), suggesting dysbiosis of the microbiome. Six bacterial\u0000families and 31 species are significantly increased in the fecal microbiota\u0000(P-adj<0.05), including members of the top abundant rumen core taxa. This\u0000global elevation of rumen microbes in the rectum microbiota suggests a\u0000potential impairment of rumen microbiota under fescue toxicosis. Among these,\u0000Ruminococcaceae bacterium P7, an important species accounting for ~2% of rumen\u0000microbiota, was the most impacted with a 16-fold increase from 0.17% to 2.8% in\u0000feces (P<0.01). We hypothesized that rumen Ruminococcaceae bacterium P7\u0000re-adapted to the large intestine environment under toxic fescue stress,\u0000causing this dramatic increase in abundance. Functional enrichment analysis\u0000revealed that the overrepresented pathways shifted from energy metabolism to\u0000antimicrobial resistance and DNA replication. In conclusion, we discovered\u0000dramatic microbiota alterations in composition, abundance, and functional\u0000capacities under fescue toxicosis, and our results suggest Ruminococcaceae\u0000bacterium P7 as a potential biomarker for fescue toxicosis management.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":"https://doi.org/arxiv-2407.12051","url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\u0000unfixed-length sequences cannot serve as the input of common data mining\u0000algorithms. Hence, various representation schemes have been developed to\u0000transform DNA sequences into fixed-length numerical representations. However,\u0000these schemes face difficulties in learning high-quality representations due to\u0000the complexity and sparsity of DNA data. Additionally, DNA sequences are\u0000inherently noisy because of mutations. While several schemes have been proposed\u0000for their effectiveness, they often lack semantic structure, making it\u0000difficult for biologists to validate and leverage the results. To address these\u0000challenges, we propose textbf{Dy-mer}, an explainable and robust DNA\u0000representation scheme based on sparse recovery. Leveraging the underlying\u0000semantic structure of DNA, we modify the traditional sparse recovery to capture\u0000recurring patterns indicative of biological functions by representing frequent\u0000K-mers as basis vectors and reconstructing each DNA sequence through simple\u0000concatenation. Experimental results demonstrate that textbf{Dy-mer} achieves\u0000state-of-the-art performance in DNA promoter classification, yielding a\u0000remarkable textbf{13%} increase in accuracy. Moreover, its inherent\u0000explainability facilitates DNA clustering and motif detection, enhancing its\u0000utility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantically Rich Local Dataset Generation for Explainable AI in Genomics","authors":"Pedro Barbosa, Rosina Savisaar, Alcides Fonseca","doi":"arxiv-2407.02984","DOIUrl":"https://doi.org/arxiv-2407.02984","url":null,"abstract":"Black box deep learning models trained on genomic sequences excel at\u0000predicting the outcomes of different gene regulatory mechanisms. Therefore,\u0000interpreting these models may provide novel insights into the underlying\u0000biology, supporting downstream biomedical applications. Due to their\u0000complexity, interpretable surrogate models can only be built for local\u0000explanations (e.g., a single instance). However, accomplishing this requires\u0000generating a dataset in the neighborhood of the input, which must maintain\u0000syntactic similarity to the original data while introducing semantic\u0000variability in the model's predictions. This task is challenging due to the\u0000complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving\u0000perturbations in sequences that contribute to their semantic diversity. Our\u0000custom, domain-guided individual representation effectively constrains\u0000syntactic similarity, and we provide two alternative fitness functions that\u0000promote diversity with no computational effort. Applied to the RNA splicing\u0000domain, our approach quickly achieves good diversity and significantly\u0000outperforms a random baseline in exploring the search space, as shown by our\u0000proof-of-concept, short RNA sequence. Furthermore, we assess its\u0000generalizability and demonstrate scalability to larger sequences, resulting in\u0000a $approx$30% improvement over the baseline.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing","authors":"Devam Mondal, Atharva Inamdar","doi":"arxiv-2407.03381","DOIUrl":"https://doi.org/arxiv-2407.03381","url":null,"abstract":"RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq,\u0000are critical tools for the biologist looking to analyze the genetic\u0000activity/transcriptome of a tissue or cell during an experimental procedure.\u0000Platforms like Illumina's next-generation sequencing (NGS) are used to produce\u0000the raw data for this experimental procedure. This raw FASTQ data must then be\u0000prepared via a complex series of data manipulations by bioinformaticians. This\u0000process currently takes place on an unwieldy textual user interface like a\u0000terminal/command line that requires the user to install and import multiple\u0000program packages, preventing the untrained biologist from initiating data\u0000analysis. Open-source platforms like Galaxy have produced a more user-friendly\u0000pipeline, yet the visual interface remains cluttered and highly technical,\u0000remaining uninviting for the natural scientist. To address this, SeqMate is a\u0000user-friendly tool that allows for one-click analytics by utilizing the power\u0000of a large language model (LLM) to automate both data preparation and analysis\u0000(differential expression, trajectory analysis, etc). Furthermore, by utilizing\u0000the power of generative AI, SeqMate is also capable of analyzing such findings\u0000and producing written reports of upregulated/downregulated/user-prompted genes\u0000with sources cited from known repositories like PubMed, PDB, and Uniprot.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences","authors":"Fatemeh Alipour, Kathleen A. Hill, Lila Kari","doi":"arxiv-2407.02538","DOIUrl":"https://doi.org/arxiv-2407.02538","url":null,"abstract":"This study proposes CGRclust, a novel combination of unsupervised twin\u0000contrastive clustering of Chaos Game Representations (CGR) of DNA sequences,\u0000with convolutional neural networks (CNNs). To the best of our knowledge,\u0000CGRclust is the first method to use unsupervised learning for image\u0000classification (herein applied to two-dimensional CGR images) for clustering\u0000datasets of DNA sequences. CGRclust overcomes the limitations of traditional\u0000sequence classification methods by leveraging unsupervised twin contrastive\u0000learning to detect distinctive sequence patterns, without requiring DNA\u0000sequence alignment or biological/taxonomic labels. CGRclust accurately\u0000clustered twenty-five diverse datasets, with sequence lengths ranging from 664\u0000bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as\u0000well as viral whole genome assemblies and synthetic DNA sequences. Compared\u0000with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and\u0000MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy\u0000across all four taxonomic levels tested for mitochondrial DNA genomes of fish.\u0000Moreover, CGRclust also consistently demonstrates superior performance across\u0000all the viral genomic datasets. The high clustering accuracy of CGRclust on\u0000these twenty-five datasets, which vary significantly in terms of sequence\u0000length, number of genomes, number of clusters, and level of taxonomy,\u0000demonstrates its robustness, scalability, and versatility.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu
{"title":"MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing","authors":"Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu","doi":"arxiv-2406.19113","DOIUrl":"https://doi.org/arxiv-2406.19113","url":null,"abstract":"Metagenomics has led to significant advances in many fields. Metagenomic\u0000analysis commonly involves the key tasks of determining the species present in\u0000a sample and their relative abundances. These tasks require searching large\u0000metagenomic databases. Metagenomic analysis suffers from significant data\u0000movement overhead due to moving large amounts of low-reuse data from the\u0000storage system. In-storage processing can be a fundamental solution for\u0000reducing this overhead. However, designing an in-storage processing system for\u0000metagenomics is challenging because existing approaches to metagenomic analysis\u0000cannot be directly implemented in storage effectively due to the hardware\u0000limitations of modern SSDs. We propose MegIS, the first in-storage processing\u0000system designed to significantly reduce the data movement overhead of the\u0000end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight\u0000design that effectively leverages and orchestrates processing inside and\u0000outside the storage system. We address in-storage processing challenges for\u0000metagenomics via specialized and efficient 1) task partitioning, 2)\u0000data/computation flow coordination, 3) storage technology-aware algorithmic\u0000optimizations, 4) data mapping, and 5) lightweight in-storage accelerators.\u0000MegIS's design is flexible, capable of supporting different types of\u0000metagenomic input datasets, and can be integrated into various metagenomic\u0000analysis pipelines. Our evaluation shows that MegIS outperforms the\u0000state-of-the-art performance- and accuracy-optimized software metagenomic tools\u0000by 2.7$times$-37.2$times$ and 6.9$times$-100.2$times$, respectively, while\u0000matching the accuracy of the accuracy-optimized tool. MegIS achieves\u00001.5$times$-5.1$times$ speedup compared to the state-of-the-art metagenomic\u0000hardware-accelerated (using processing-in-memory) tool, while achieving\u0000significantly higher accuracy.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online t-SNE for single-cell RNA-seq","authors":"Hui Ma, Kai Chen","doi":"arxiv-2406.14842","DOIUrl":"https://doi.org/arxiv-2406.14842","url":null,"abstract":"Due to the sequential sample arrival, changing experiment conditions, and\u0000evolution of knowledge, the demand to continually visualize evolving structures\u0000of sequential and diverse single-cell RNA-sequencing (scRNA-seq) data becomes\u0000indispensable. However, as one of the state-of-the-art visualization and\u0000analysis methods for scRNA-seq, t-distributed stochastic neighbor embedding\u0000(t-SNE) merely visualizes static scRNA-seq data offline and fails to meet the\u0000demand well. To address these challenges, we introduce online t-SNE to\u0000seamlessly integrate sequential scRNA-seq data. Online t-SNE achieves this by\u0000leveraging the embedding space of old samples, exploring the embedding space of\u0000new samples, and aligning the two embedding spaces on the fly. Consequently,\u0000online t-SNE dramatically enables the continual discovery of new structures and\u0000high-quality visualization of new scRNA-seq data without retraining from\u0000scratch. We showcase the formidable visualization capabilities of online t-SNE\u0000across diverse sequential scRNA-seq datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians","authors":"Haoyang Liu, Haohan Wang","doi":"arxiv-2406.15341","DOIUrl":"https://doi.org/arxiv-2406.15341","url":null,"abstract":"Recent advancements in machine learning have significantly improved the\u0000identification of disease-associated genes from gene expression datasets.\u0000However, these processes often require extensive expertise and manual effort,\u0000limiting their scalability. Large Language Model (LLM)-based agents have shown\u0000promise in automating these tasks due to their increasing problem-solving\u0000abilities. To support the evaluation and development of such methods, we\u0000introduce GenoTEX, a benchmark dataset for the automatic exploration of gene\u0000expression data, involving the tasks of dataset selection, preprocessing, and\u0000statistical analysis. GenoTEX provides annotated code and results for solving a\u0000wide range of gene identification problems, in a full analysis pipeline that\u0000follows the standard of computational genomics. These annotations are curated\u0000by human bioinformaticians who carefully analyze the datasets to ensure\u0000accuracy and reliability. To provide baselines for these tasks, we present\u0000GenoAgents, a team of LLM-based agents designed with context-aware planning,\u0000iterative correction, and domain expert consultation to collaboratively explore\u0000gene datasets. Our experiments with GenoAgents demonstrate the potential of\u0000LLM-based approaches in genomics data analysis, while error analysis highlights\u0000the challenges and areas for future improvement. We propose GenoTEX as a\u0000promising resource for benchmarking and enhancing AI-driven methods for\u0000genomics data analysis. We make our benchmark publicly available at\u0000url{https://github.com/Liu-Hy/GenoTex}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}