Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text.

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting Pub Date : 2025-04-01 DOI:10.18653/v1/2025.naacl-industry.68

Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis Tafoya, Kushal Virupakshappa, Avinash Sahu

{"title":"Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text.","authors":"Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis Tafoya, Kushal Virupakshappa, Avinash Sahu","doi":"10.18653/v1/2025.naacl-industry.68","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins play critical roles in biological systems, yet 99.7% of over 227 million known protein sequences remain uncharacterized due to the limitations of experimental methods. To assist experimentalists in narrowing down hypotheses and accelerating protein characterization, we present Protein2Text, a multimodal large language model that interprets protein sequences and generates informative text to address open-ended questions about protein functions and attributes. By integrating a resampling mechanism within an adapted LLaVA framework, our model effectively maps protein sequences into a language-compatible space, enhancing its capability to handle diverse and complex queries. Trained on a newly curated dataset derived from PubMed articles and rigorously evaluated using four comprehensive benchmarks-including in-domain and cross-domain evaluations-Protein2Text outperforms several existing models in open-ended question-answering tasks. Our work also highlights the limitations of current evaluation metrics applied to template-based approaches, which may lead to misleading results, emphasizing the need for unbiased assessment methods. Our model weights, evaluation datasets, and evaluation scripts are publicly available at https://github.com/alaaj27/Protein2Text.git.</p>","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"2025 ","pages":"918-937"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12281053/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2025.naacl-industry.68","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Proteins play critical roles in biological systems, yet 99.7% of over 227 million known protein sequences remain uncharacterized due to the limitations of experimental methods. To assist experimentalists in narrowing down hypotheses and accelerating protein characterization, we present Protein2Text, a multimodal large language model that interprets protein sequences and generates informative text to address open-ended questions about protein functions and attributes. By integrating a resampling mechanism within an adapted LLaVA framework, our model effectively maps protein sequences into a language-compatible space, enhancing its capability to handle diverse and complex queries. Trained on a newly curated dataset derived from PubMed articles and rigorously evaluated using four comprehensive benchmarks-including in-domain and cross-domain evaluations-Protein2Text outperforms several existing models in open-ended question-answering tasks. Our work also highlights the limitations of current evaluation metrics applied to template-based approaches, which may lead to misleading results, emphasizing the need for unbiased assessment methods. Our model weights, evaluation datasets, and evaluation scripts are publicly available at https://github.com/alaaj27/Protein2Text.git.

查看原文本刊更多论文

Protein2Text：将蛋白质序列翻译成人类可解释的文本的重新采样机制。

蛋白质在生物系统中起着至关重要的作用，但由于实验方法的限制，超过2.27亿个已知蛋白质序列中有99.7%尚未表征。为了帮助实验人员缩小假设范围并加速蛋白质表征，我们提出了Protein2Text，这是一个多模态大语言模型，可以解释蛋白质序列并生成信息文本，以解决有关蛋白质功能和属性的开放式问题。通过在适应的LLaVA框架内集成重新采样机制，我们的模型有效地将蛋白质序列映射到语言兼容的空间中，增强了其处理多样化和复杂查询的能力。protein2text在一个来自PubMed文章的新整理的数据集上进行了训练，并使用四个综合基准进行了严格的评估——包括域内和跨域评估——在开放式问答任务中优于几个现有的模型。我们的工作还强调了应用于基于模板的方法的当前评估指标的局限性，这可能导致误导性的结果，强调了对公正评估方法的需求。我们的模型权重、评估数据集和评估脚本可以在https://github.com/alaaj27/Protein2Text.git上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting

自引率

0.00%

发文量