Bioinformatics (Oxford, England)最新文献

筛选
英文 中文
Efficient storage and regression computation for population-scale genome sequencing studies.
Bioinformatics (Oxford, England) Pub Date : 2025-02-11 DOI: 10.1093/bioinformatics/btaf067
Manuel A Rivas, Christopher Chang
{"title":"Efficient storage and regression computation for population-scale genome sequencing studies.","authors":"Manuel A Rivas, Christopher Chang","doi":"10.1093/bioinformatics/btaf067","DOIUrl":"10.1093/bioinformatics/btaf067","url":null,"abstract":"<p><strong>Motivation: </strong>The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.</p><p><strong>Results: </strong>We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125,077 individuals (AllofUs project data), we reduced runtime from 695.35 minutes (11.5 hours) on a single machine to 1.57 minutes with 30 GB of memory and 50 threads (or 8.67 minutes with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.</p><p><strong>Availability: </strong>Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: https://www.cog-genomics.org/plink/2.0/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143400670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ImmunoTar-integrative prioritization of cell surface targets for cancer immunotherapy.
Bioinformatics (Oxford, England) Pub Date : 2025-02-11 DOI: 10.1093/bioinformatics/btaf060
Rawan Shraim, Brian Mooney, Karina L Conkrite, Amber K Hamilton, Gregg B Morin, Poul H Sorensen, John M Maris, Sharon J Diskin, Ahmet Sacan
{"title":"ImmunoTar-integrative prioritization of cell surface targets for cancer immunotherapy.","authors":"Rawan Shraim, Brian Mooney, Karina L Conkrite, Amber K Hamilton, Gregg B Morin, Poul H Sorensen, John M Maris, Sharon J Diskin, Ahmet Sacan","doi":"10.1093/bioinformatics/btaf060","DOIUrl":"10.1093/bioinformatics/btaf060","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer remains a leading cause of mortality globally. Recent improvements in survival have been facilitated by the development of targeted and less toxic immunotherapies, such as chimeric antigen receptor (CAR)-T cells and antibody-drug conjugates (ADCs). These therapies, effective in treating both pediatric and adult patients with solid and hematological malignancies, rely on the identification of cancer-specific surface protein targets. While technologies like RNA sequencing and proteomics exist to survey these targets, identifying optimal targets for immunotherapies remains a challenge in the field.</p><p><strong>Results: </strong>To address this challenge, we developed ImmunoTar, a novel computational tool designed to systematically prioritize candidate immunotherapeutic targets. ImmunoTar integrates user-provided RNA-sequencing or proteomics data with quantitative features from multiple public databases, selected based on predefined criteria, to generate a score representing the gene's suitability as an immunotherapeutic target. We validated ImmunoTar using three distinct cancer datasets, demonstrating its effectiveness in identifying both known and novel targets across various cancer phenotypes. By compiling diverse data into a unified platform, ImmunoTar enables comprehensive evaluation of surface proteins, streamlining target identification and empowering researchers to efficiently allocate resources, thereby accelerating the development of effective cancer immunotherapies.</p><p><strong>Availability: </strong>Code and data to run and test ImmunoTar are available at https://github.com/sacanlab/immunotar.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143392728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APNet, an explainable sparse deep learning model to discover differentially active drivers of severe COVID-19.
Bioinformatics (Oxford, England) Pub Date : 2025-02-08 DOI: 10.1093/bioinformatics/btaf063
George I Gavriilidis, Vasileios Vasileiou, Stella Dimitsaki, Georgios Karakatsoulis, Antonis Giannakakis, Georgios A Pavlopoulos, Fotis Psomopoulos
{"title":"APNet, an explainable sparse deep learning model to discover differentially active drivers of severe COVID-19.","authors":"George I Gavriilidis, Vasileios Vasileiou, Stella Dimitsaki, Georgios Karakatsoulis, Antonis Giannakakis, Georgios A Pavlopoulos, Fotis Psomopoulos","doi":"10.1093/bioinformatics/btaf063","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf063","url":null,"abstract":"<p><strong>Motivation: </strong>Computational analyses of bulk and single-cell omics provide translational insights into complex diseases, such as COVID-19, by revealing molecules, cellular phenotypes, and signalling patterns that contribute to unfavourable clinical outcomes. Current in silico approaches dovetail differential abundance, biostatistics, and machine learning, but often overlook non-linear proteomic dynamics, like post-translational modifications, and provide limited biological interpretability beyond feature ranking.</p><p><strong>Results: </strong>We introduce APNet, a novel computational pipeline that combines differential activity analysis based on SJARACNe co-expression networks with PASNet, a biologically-informed sparse deep learning model, to perform explainable predictions for COVID-19 severity. The APNet driver-pathway network ingests SJARACNe co-regulation and classification weights to aid result interpretation and hypothesis generation. APNet outperforms alternative models in patient classification across three COVID-19 proteomic datasets, identifying predictive drivers and pathways, including some confirmed in single-cell omics and highlighting under-explored biomarker circuitries in COVID-19.</p><p><strong>Availability and implementation: </strong>APNet's R, Python scripts and Cytoscape methodologies are available at https://github.com/BiodataAnalysisGroup/APNet.</p><p><strong>Supplementary information: </strong>Supplementary information can be accessed in Zenodo (10.5281/zenodo.14680520).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143374988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COBREXA 2: tidy and scalable construction of complex metabolic models.
Bioinformatics (Oxford, England) Pub Date : 2025-02-08 DOI: 10.1093/bioinformatics/btaf056
Miroslav Kratochvíl, St Elmo Wilken, Oliver Ebenhöh, Reinhard Schneider, Venkata P Satagopam
{"title":"COBREXA 2: tidy and scalable construction of complex metabolic models.","authors":"Miroslav Kratochvíl, St Elmo Wilken, Oliver Ebenhöh, Reinhard Schneider, Venkata P Satagopam","doi":"10.1093/bioinformatics/btaf056","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf056","url":null,"abstract":"<p><strong>Summary: </strong>Constraint-based metabolic models offer a scalable framework to investigate biological systems using optimality principles. Construction and simulation of detailed models that utilize multiple kinds of constraint systems poses a significant coding overhead, complicating implementation of new types of analyses. We present an improved version of the constraint-based metabolic modeling package COBREXA, which utilizes a hierarchical model construction framework that decouples the implemented analysis algorithms into independent, yet re-combinable, building blocks. By removing the need to re-implement modeling components, assembly of complex metabolic models is simplified, which we demonstrate on use-cases of resource-balanced models, and enzyme-constrained flux balance models of interacting bacterial communities. Notably, these models show improved predictive capabilities in both monoculture and community settings. In perspective, the re-usable model-building components in COBREXA 2 provide a sustainable way to handle increasingly complex models in constraint-based modeling.</p><p><strong>Availability and implementation: </strong>COBREXA 2 is available from https://github.com/COBREXA/COBREXA.jl, and from Julia package repositories. COBREXA 2 works on all major operating systems and computer architectures. Documentation is available at https://cobrexa.github.io/COBREXA.jl/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143375005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NMFProfiler: A multi-omics integration method for samples stratified in groups.
Bioinformatics (Oxford, England) Pub Date : 2025-02-08 DOI: 10.1093/bioinformatics/btaf066
Aurélie Mercadié, Éléonore Gravier, Gwendal Josse, Isabelle Fournier, Cécile Viodé, Nathalie Vialaneix, Céline Brouard
{"title":"NMFProfiler: A multi-omics integration method for samples stratified in groups.","authors":"Aurélie Mercadié, Éléonore Gravier, Gwendal Josse, Isabelle Fournier, Cécile Viodé, Nathalie Vialaneix, Céline Brouard","doi":"10.1093/bioinformatics/btaf066","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf066","url":null,"abstract":"<p><strong>Motivation: </strong>The development of high-throughput sequencing enabled the massive production of \"omics\" data for various applications in biology. By analyzing simultaneously paired datasets collected on the same samples, integrative statistical approaches allow researchers to get a global picture of such systems and to highlight existing relationships between various molecular types and levels. Here, we introduce NMFProfiler, an integrative supervised NMF that accounts for the stratification of samples into groups of biological interest.</p><p><strong>Results: </strong>NMFProfiler was shown to successfully extract signatures characterizing groups with performances comparable to or better than state-of-the-art approaches. In particular, NMFProfiler was used in a clinical study on Atopic Dermatitis (AD) and to analyze a multi-omic cancer dataset. In the first case, it successfully identified signatures combining known AD protein biomarkers and novel transcriptomic biomarkers. In addition, it was also able to extract signatures significantly associated to cancer survival.</p><p><strong>Availability: </strong>NMFProfiler is released as a Python package, NMFProfiler (v0.3.0), available on PyPI.</p><p><strong>Supplementary information: </strong>Supplementary Table S1 and Supplementary material are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143375030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MolFCL: predicting molecular properties through chemistry-guided contrastive and prompt learning.
Bioinformatics (Oxford, England) Pub Date : 2025-02-08 DOI: 10.1093/bioinformatics/btaf061
Xiang Tang, Qichang Zhao, Jianxin Wang, Guihua Duan
{"title":"MolFCL: predicting molecular properties through chemistry-guided contrastive and prompt learning.","authors":"Xiang Tang, Qichang Zhao, Jianxin Wang, Guihua Duan","doi":"10.1093/bioinformatics/btaf061","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf061","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately identifying and predicting molecular properties is a crucial task in molecular machine learning, and the key lies in how to extract effective molecular representations. Contrastive learning opens new avenues for representation learning, and a large amount of unlabeled data enables the model to generalize to the huge chemical space. However, existing contrastive learning-based models face two challenges: (1) existing methods destroy the original molecular environment and ignore chemical prior information, and (2) there is a lack of a prior knowledge to guide the prediction of molecular properties.</p><p><strong>Results: </strong>In this work, we propose a molecular property prediction framework called MolFCL, which consists of fragment-based contrastive learning and functional group-based prompt learning. Specifically, we introduced fragment-fragment interactions for the first time in the contrastive learning framework and designed a fragment-based augmented molecular graph that integrates the original chemical environment and fragment reactions. Furthermore, we proposed a novel functional group-based prompt learning during fine-tuning, which first incorporates functional group knowledge and the corresponding atomic signals, to improve molecular representation and provide interpretable analyses. The results show that MolFCL outperforms state-of-the-art baseline models on 23 molecular property prediction datasets. Moreover, visualizations show that MolFCL can learn to embed molecules into representations that can distinguish chemical properties. MolFCL can give higher weight to functional groups consistent with chemical knowledge during the prediction of molecular properties, which offers an interpretable ability of the model. Overall, MolFCL is a practically useful tool for molecular property prediction and assists drug scientists in designing drugs more effectively.</p><p><strong>Availability and implementation: </strong>MolFCL is available at https://github.com/tangxiangcsu/MolFCL.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143375024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ParaSurf: A Surface-Based Deep Learning Approach for Paratope-Antigen Interaction Prediction.
Bioinformatics (Oxford, England) Pub Date : 2025-02-08 DOI: 10.1093/bioinformatics/btaf062
Angelos-Michael Papadopoulos, Apostolos Axenopoulos, Anastasia Iatrou, Kostas Stamatopoulos, Federico Alvarez, Petros Daras
{"title":"ParaSurf: A Surface-Based Deep Learning Approach for Paratope-Antigen Interaction Prediction.","authors":"Angelos-Michael Papadopoulos, Apostolos Axenopoulos, Anastasia Iatrou, Kostas Stamatopoulos, Federico Alvarez, Petros Daras","doi":"10.1093/bioinformatics/btaf062","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf062","url":null,"abstract":"<p><strong>Motivation: </strong>Identifying antibody binding sites, is crucial for developing vaccines and therapeutic antibodies, processes that are time-consuming and costly. Accurate prediction of the paratope's binding site can speed up the development by improving our understanding of antibody-antigen interactions.</p><p><strong>Results: </strong>We present ParaSurf, a deep learning model that significantly enhances paratope prediction by incorporating both surface geometric and non-geometric factors. Trained and tested on three prominent antibody-antigen benchmarks, ParaSurf achieves state-of-the-art results across nearly all metrics. Unlike models restricted to the variable region, ParaSurf demonstrates the ability to accurately predict binding scores across the entire Fab region of the antibody. Additionally, we conducted an extensive analysis using the largest of the three datasets employed, focusing on three key components: (1) a detailed evaluation of paratope prediction for each Complementarity-Determining Region loop, (2) the performance of models trained exclusively on the heavy chain, and (3) the results of training models solely on the light chain without incorporating data from the heavy chain.</p><p><strong>Availability and implementation: </strong>Source code for ParaSurf, along with the datasets used, preprocessing pipeline, and trained model weights, are freely available at https://github.com/aggelos-michael-papadopoulos/ParaSurf.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143375036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepES: Deep learning-based enzyme screening to identify orphan enzyme genes.
Bioinformatics (Oxford, England) Pub Date : 2025-02-06 DOI: 10.1093/bioinformatics/btaf053
Keisuke Hirota, Felix Salim, Takuji Yamada
{"title":"DeepES: Deep learning-based enzyme screening to identify orphan enzyme genes.","authors":"Keisuke Hirota, Felix Salim, Takuji Yamada","doi":"10.1093/bioinformatics/btaf053","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf053","url":null,"abstract":"<p><strong>Motivation: </strong>Progress in sequencing technology has led to determination of large numbers of protein sequences, and large enzyme databases are now available. Although many computational tools for enzyme annotation were developed, sequence information is unavailable for many enzymes, known as orphan enzymes. These orphan enzymes hinder sequence similarity-based functional annotation, leading gaps in understanding the association between sequences and enzymatic reactions.</p><p><strong>Results: </strong>Therefore, we developed DeepES, a deep learning-based tool for enzyme screening to identify orphan enzyme genes, focusing on biosynthetic gene clusters and reaction class. DeepES uses protein sequences as inputs and evaluates whether the input genes contain biosynthetic gene clusters of interest by integrating the outputs of the binary classifier for each reaction class. The validation results suggested that DeepES can capture functional similarity between protein sequences, and it can be implemented to explore orphan enzyme genes. By applying DeepES to 4744 metagenome-assembled genomes, we identified candidate genes for 236 orphan enzymes, including those involved in short-chain fatty acid production as a characteristic pathway in human gut bacteria.</p><p><strong>Availability: </strong>DeepES is available at https://github.com/yamada-lab/DeepES. Model weights and the candidate genes are available at Zenodo (https://doi.org/10.5281/zenodo.11123900).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143256487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ESPClust: Unsupervised identification of modifiers for the effect size profile in omics association studies.
Bioinformatics (Oxford, England) Pub Date : 2025-02-06 DOI: 10.1093/bioinformatics/btaf065
Francisco J Pérez-Reche, Nathan J Cheetham, Ruth C E Bowyer, Ellen J Thompson, Francesca Tettamanzi, Cristina Menni, Claire J Steves
{"title":"ESPClust: Unsupervised identification of modifiers for the effect size profile in omics association studies.","authors":"Francisco J Pérez-Reche, Nathan J Cheetham, Ruth C E Bowyer, Ellen J Thompson, Francesca Tettamanzi, Cristina Menni, Claire J Steves","doi":"10.1093/bioinformatics/btaf065","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf065","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput omics technologies have revolutionised the identification of associations between individual traits and underlying biological characteristics, but still use 'one effect-size fits all' approaches. While covariates are often used, their potential as effect modifiers often remains unexplored.</p><p><strong>Results: </strong>We propose ESPClust, a novel unsupervised method designed to identify covariates that modify the effect size of associations between sets of omics variables and outcomes. By extending the concept of moderators to encompass multiple exposures, ESPClust analyses the effect size profile (ESP) to identify regions in covariate space with different ESP, enabling the discovery of subpopulations with distinct associations. Applying ESPClust to synthetic data, insulin resistance and COVID-19 symptom manifestation, we demonstrate its versatility and ability to uncover nuanced effect size modifications that traditional analyses may overlook. By integrating information from multiple exposures, ESPClust identifies effect size modifiers in datasets that are too small for traditional univariate stratified analyses. This method provides a robust framework for understanding complex omics data and holds promise for personalised medicine.</p><p><strong>Availability and implementation: </strong>The source code ESPClust is available at https://github.com/fjpreche/ESPClust.git.It can be installed via Python package repositories as `pip install ESPClust==1.1.0`.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Embed-Search-Align: DNA sequence alignment using transformer models.
Bioinformatics (Oxford, England) Pub Date : 2025-02-06 DOI: 10.1093/bioinformatics/btaf041
Pavan Holur, K C Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S Bouchard, Matteo Vwani, Pellegrini Roychowdhury
{"title":"Embed-Search-Align: DNA sequence alignment using transformer models.","authors":"Pavan Holur, K C Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S Bouchard, Matteo Vwani, Pellegrini Roychowdhury","doi":"10.1093/bioinformatics/btaf041","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf041","url":null,"abstract":"<p><strong>Motivation: </strong>DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models (LLM) in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison.</p><p><strong>Results: </strong>We bridge this gap by developing a \"Embed-Search-Align\" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (1) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (2) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of 6 recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species.</p><p><strong>Availability and information: </strong>Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.</p><p><strong>Supplementary information: </strong>Please see attached file.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信