Bioinformatics advances最新文献

筛选
英文 中文
Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database. 对PubChem数据库检索任务中支持搜索的预训练大型语言模型的评估。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-24 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf064
Ash Sze, Soha Hassoun
{"title":"Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database.","authors":"Ash Sze, Soha Hassoun","doi":"10.1093/bioadv/vbaf064","DOIUrl":"10.1093/bioadv/vbaf064","url":null,"abstract":"<p><strong>Motivation: </strong>Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.</p><p><strong>Results: </strong>We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.</p><p><strong>Availability and implementation: </strong>All text used to prompt ChatGPT-4o is provided in the manuscript.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf064"},"PeriodicalIF":2.4,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12073969/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144042362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TITINdb2-expanding annotation and structural information for protein variants in the giant sarcomeric protein titin. titinb2 -扩展注释和结构信息的蛋白质变异在巨大的肌肉蛋白titin。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf062
Timir Weston, Joseph Ng, Oriol Gracia Carmona, Mathias Gautel, Franca Fraternali
{"title":"TITINdb2-expanding annotation and structural information for protein variants in the giant sarcomeric protein titin.","authors":"Timir Weston, Joseph Ng, Oriol Gracia Carmona, Mathias Gautel, Franca Fraternali","doi":"10.1093/bioadv/vbaf062","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf062","url":null,"abstract":"<p><strong>Summary: </strong>We present TITINdb2, an update to the TITINdb database previously constructed to facilitate the identification of pathogenic missense variants in the giant protein titin, which are associated with a variety of skeletal and cardiac myopathies. The database and web portal have been substantially revised and include the following new features: (i) an increase in computational annotation from 4 to 20 variant impact predictors, available through a new custom data table dialogue; (ii) through structural coverage of single domains with AlphaFold2 predicted models; (iii) newly predicted domain-domain interface annotations; (iv) an expanded <i>in silico</i> saturation mutagenesis incorporating four variant impact predictors; (v) a comprehensive overhaul of available data, including population data sources and variants reported pathogenic in the literature; and (vi) a curated mapping of existing protein, transcript, and chromosomal sequence positions and a new variant conversion tool to translate variants in one format to any other format.</p><p><strong>Availability and implementation: </strong>The database is accessible via titindb.kcl.ac.uk/TITINdb/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf062"},"PeriodicalIF":2.4,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12017618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144027645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive adjustment of profile HMM significance thresholds improves functional and metabolic insights into microbial genomes. 剖面HMM显著性阈值的自适应调整提高了对微生物基因组的功能和代谢见解。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf039
Kathryn Kananen, Iva Veseli, Christian J Quiles Pérez, Samuel E Miller, A Murat Eren, Patrick H Bradley
{"title":"Adaptive adjustment of profile HMM significance thresholds improves functional and metabolic insights into microbial genomes.","authors":"Kathryn Kananen, Iva Veseli, Christian J Quiles Pérez, Samuel E Miller, A Murat Eren, Patrick H Bradley","doi":"10.1093/bioadv/vbaf039","DOIUrl":"10.1093/bioadv/vbaf039","url":null,"abstract":"<p><strong>Motivation: </strong>Gene function annotation in microbial genomes and metagenomes is a fundamental <i>in silico</i> first step toward understanding metabolic potential and determinants of fitness. The Kyoto Encyclopedia of Genes and Genomes publishes a curated list of profile hidden Markov models to identify orthologous gene families (KOfams) with roles in metabolism. However, the computational tools that rely upon KOfams yield different annotations for the same set of genomes, leading to different downstream biological inferences.</p><p><strong>Results: </strong>Here, we apply three open-source software tools that can annotate KOfams to genomes of phylogenetically diverse bacterial families from host-associated and free-living biomes. We use multiple computational approaches to benchmark these methods and investigate individual case studies where they differ. Our results show that despite their fundamental similarities, these methods have different annotation rates and quality. In particular, a method that adaptively tunes sequence similarity thresholds substantially improves sensitivity while maintaining high accuracy. We observe particularly large improvements for protein families with few reference sequences, or when annotating genomes from nonmodel organisms (such as gut-dwelling <i>Lachnospiraceae</i>). Our findings show that small improvements in annotation workflows can maximize the utility of existing databases and meaningfully improve <i>in silico</i> characterizations of microbial metabolism.</p><p><strong>Availability and implementation: </strong>Anvi'o is available at https://anvio.org under the GNU GPL license. Scripts and workflow are available at https://github.com/pbradleylab/2023-anvio-comparison under the MIT license.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf039"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
peptidy: a light-weight Python library for peptide representation in machine learning. peptidy:用于机器学习中肽表示的轻量级Python库。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf058
Rıza Özçelik, Laura van Weesep, Sarah de Ruiter, Francesca Grisoni
{"title":"peptidy: a light-weight Python library for peptide representation in machine learning.","authors":"Rıza Özçelik, Laura van Weesep, Sarah de Ruiter, Francesca Grisoni","doi":"10.1093/bioadv/vbaf058","DOIUrl":"10.1093/bioadv/vbaf058","url":null,"abstract":"<p><strong>Motivation: </strong>Peptides are widely used in applications ranging from drug discovery to food technologies. Machine learning has become increasingly prominent in accelerating the search for new peptides, and user-friendly computational tools can further enhance these efforts.</p><p><strong>Results: </strong>In this work, we introduce peptidy-a lightweight Python library that facilitates converting peptides (expressed as amino acid sequences) to numerical representations suited to machine learning. peptidy is free from external dependencies, integrates seamlessly into modern Python environments, and supports a range of encoding strategies suitable for both predictive and generative machine learning approaches. Additionally, peptidy supports peptides with post-translational modifications, such as phosphorylation, acetylation, and methylation, thereby extending the functionality of existing Python packages for peptides and proteins.</p><p><strong>Availability and implementation: </strong>peptidy is freely available with a permissive license on GitHub at the following URL: https://github.com/molML/peptidy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf058"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11961219/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ShinyTHOR app: Shiny-built tumor high-throughput omics-based roadmap. ShinyTHOR应用程序:shine构建的肿瘤高通量组学路线图。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf061
Eduardo Navarrete-Bencomo, Anthony Vladimir Campos Segura, Orlando R Sevillano, Ana Mayanga, José Luis Buleje Sono, César Alexander Ortiz Rojas, Alexis Germán Murillo Carrasco
{"title":"ShinyTHOR app: Shiny-built tumor high-throughput omics-based roadmap.","authors":"Eduardo Navarrete-Bencomo, Anthony Vladimir Campos Segura, Orlando R Sevillano, Ana Mayanga, José Luis Buleje Sono, César Alexander Ortiz Rojas, Alexis Germán Murillo Carrasco","doi":"10.1093/bioadv/vbaf061","DOIUrl":"10.1093/bioadv/vbaf061","url":null,"abstract":"<p><strong>Motivation: </strong>The Cancer Cell Line Encyclopedia (CCLE), launched in 2008, systematically organizes multi-omic and pharmacological data from over 1000 cancer cell lines with molecular dependency maps accessible through the DepMap tool. However, DepMap lacks tools for systematic comparison of mRNAs, miRNAs, proteins, and metabolites, as well as their links to drug responses and gene signatures. Extracting this data externally requires bioinformatics expertise, limiting access for wet-lab researchers.</p><p><strong>Results: </strong>We developed ShinyTHOR, a web app for intuitive access to multi-omic (transcriptomic, metabolomic, methylomic, proteomic, and miRNomic) and drug-related data. It integrates datasets from CCLE, miRTarBase, circInteractome, and the Genomics of Drug Sensitivity in Cancer. ShinyTHOR includes six modules: (1) identifying top/bottom ten cell lines by marker expression or drug IC50, (2) single-sample Gene Set Enrichment Analysis (ssGSEA), (3) multi-analyte expression evaluation, (4) miRNA-target interactions across cancer types, (5) miRNA impact on mRNA translation via protein levels, and (6) circRNA/miRNA modulation. To validate its utility, we applied ShinyTHOR to a gastric cancer prognosis study (GES7).</p><p><strong>Availability and implementation: </strong>ShinyTHOR is freely accessible for non-commercial use at https://alexismurillo.shinyapps.io/ShinyThor, with source code available at https://github.com/Murillo22/ShinyThor.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf061"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12085240/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144095980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AntiFold: improved structure-based antibody design using inverse folding. AntiFold:利用反折叠改进了基于结构的抗体设计。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae202
Magnus Haraldson Høie, Alissa M Hummer, Tobias H Olsen, Broncio Aguilar-Sanjuan, Morten Nielsen, Charlotte M Deane
{"title":"AntiFold: improved structure-based antibody design using inverse folding.","authors":"Magnus Haraldson Høie, Alissa M Hummer, Tobias H Olsen, Broncio Aguilar-Sanjuan, Morten Nielsen, Charlotte M Deane","doi":"10.1093/bioadv/vbae202","DOIUrl":"10.1093/bioadv/vbae202","url":null,"abstract":"<p><strong>Summary: </strong>The design and optimization of antibodies requires an intricate balance across multiple properties. Protein inverse folding models, capable of generating diverse sequences folding into the same structure, are promising tools for maintaining structural integrity during antibody design. Here, we present AntiFold, an antibody-specific inverse folding model, fine-tuned from ESM-IF1 on solved and predicted antibody structures. AntiFold outperforms existing inverse folding tools on sequence recovery across complementarity-determining regions, with designed sequences showing high structural similarity to their solved counterpart. It additionally achieves stronger correlations when predicting antibody-antigen binding affinity in a zero-shot manner. AntiFold assigns low probabilities to mutations that disrupt antigen binding, synergizing with protein language model residue probabilities, and demonstrates promise for guiding antibody optimization while retaining structure-related properties.</p><p><strong>Availability and implementation: </strong>AntiFold is freely available under the BSD 3-Clause as a web server (https://opig.stats.ox.ac.uk/webapps/antifold/) and pip-installable package (https://github.com/oxpig/AntiFold).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae202"},"PeriodicalIF":2.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11961221/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zepyros: a webserver to evaluate the shape complementarity of protein-protein interfaces. Zepyros:一个评估蛋白质-蛋白质界面形状互补性的web服务器。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf051
Mattia Miotto, Lorenzo Di Rienzo, Leonardo Bo', Giancarlo Ruocco, Edoardo Milanetti
{"title":"Zepyros: a webserver to evaluate the shape complementarity of protein-protein interfaces.","authors":"Mattia Miotto, Lorenzo Di Rienzo, Leonardo Bo', Giancarlo Ruocco, Edoardo Milanetti","doi":"10.1093/bioadv/vbaf051","DOIUrl":"10.1093/bioadv/vbaf051","url":null,"abstract":"<p><strong>Motivation: </strong>Shape complementarity of molecular surfaces at the interfaces is a well-known characteristic of protein-protein binding regions, and it is critical in influencing the stability of the complex. Measuring such complementarity is of great importance for a number of theoretical and practical implications; however, only a limited number of tools are currently available to efficiently and rapidly assess it.</p><p><strong>Results: </strong>Here, we introduce Zepyros (ZErnike Polynomials analYsis of pROtein Shapes), a webserver for fast measurement of the shape complementarity between two molecular interfaces of a given protein-protein complex using structural information. Zepyros is implemented as a publicly available tool with a user-friendly interface.</p><p><strong>Availability and implementation: </strong>Our server can be found at the following link (all major browser supported): https://zepyros.bio-groups.com.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf051"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11968322/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Biological databases in the age of generative artificial intelligence. 人工智能时代的生物数据库。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf044
Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow
{"title":"Biological databases in the age of generative artificial intelligence.","authors":"Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow","doi":"10.1093/bioadv/vbaf044","DOIUrl":"10.1093/bioadv/vbaf044","url":null,"abstract":"<p><strong>Summary: </strong>Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf044"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S2Map: a novel computational platform for identifying secretio-types through cell secretion-signal map. S2Map:通过细胞分泌信号图谱识别分泌物类型的新型计算平台。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf059
Zongliang Yue, Lang Zhou, Peizhen Sun, Xuejia Kang, Fengyuan Huang, Pengyu Chen
{"title":"S2Map: a novel computational platform for identifying secretio-types through cell secretion-signal map.","authors":"Zongliang Yue, Lang Zhou, Peizhen Sun, Xuejia Kang, Fengyuan Huang, Pengyu Chen","doi":"10.1093/bioadv/vbaf059","DOIUrl":"10.1093/bioadv/vbaf059","url":null,"abstract":"<p><strong>Motivation: </strong>Cell communication is predominantly governed by secreted proteins, whose diverse secretion patterns often signify underlying physiological irregularities. Understanding these secreted signals at an individual cell level is crucial for gaining insights into regulatory mechanisms involving various molecular agents. To elucidate the array of cell secretion signals, which encompass different types of biomolecular secretion cues from individual immune cells, we introduce the secretion-signal map (S2Map).</p><p><strong>Results: </strong>S2Map is an online interactive analytical platform designed to explore and interpret distinct cell secretion-signal patterns visually. It incorporates two innovative qualitative metrics, the signal inequality index and the signal coverage index, which are exquisitely sensitive in measuring dissymmetry and diffusion of signals in temporal data. S2Map's innovation lies in its depiction of signals through time-series analysis with multi-layer visualization. We tested the SII and SCI performance in distinguishing the simulated signal diffusion models. S2Map hosts a repository for the single-cell's secretion-signal data for exploring cell secretio-types, a new cell phenotyping based on the cell secretion signal pattern. We anticipate that S2Map will be a powerful tool to delve into the complexities of physiological systems, providing insights into the regulation of protein production, such as cytokines at the remarkable resolution of single cells.</p><p><strong>Availability and implementation: </strong>The S2Map server is publicly accessible via https://au-s2map.streamlit.app/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf059"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aggregating residue-level protein language model embeddings with optimal transport. 基于最优转运的残基级蛋白质语言模型嵌入聚合。
IF 2.4
Bioinformatics advances Pub Date : 2025-03-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf060
Navid NaderiAlizadeh, Rohit Singh
{"title":"Aggregating residue-level protein language model embeddings with optimal transport.","authors":"Navid NaderiAlizadeh, Rohit Singh","doi":"10.1093/bioadv/vbaf060","DOIUrl":"10.1093/bioadv/vbaf060","url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations.</p><p><strong>Results: </strong>We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling.</p><p><strong>Availability and implementation: </strong>Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf060"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11961220/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信