Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets.

bioRxiv : the preprint server for biology Pub Date : 2025-01-28 DOI:10.1101/2024.11.22.624936

Luiz C Vieira, Morgan L Handojo, Claus O Wilke

{"title":"Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets.","authors":"Luiz C Vieira, Morgan L Handojo, Claus O Wilke","doi":"10.1101/2024.11.22.624936","DOIUrl":null,"url":null,"abstract":"<p><p>Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various pLMs across multiple biological datasets to assess the impact of model size on transfer learning. Surprisingly, we found that larger models not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts-ESM-2 15B and ESM C 6B-despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11601519/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.11.22.624936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various pLMs across multiple biological datasets to assess the impact of model size on transfer learning. Surprisingly, we found that larger models not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts-ESM-2 15B and ESM C 6B-despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.

查看原文本刊更多论文

缩小规模，提高效率：用于蛋白质序列转移学习的中等规模变压器模型。

蛋白质语言模型，如基于变换器的进化尺度建模 2（ESM2），可以深入揭示蛋白质的进化和结构特性。虽然ESM2 15B等更大型的模型有望捕捉序列空间中更复杂的模式，但由于其维度高、计算成本高，它们也带来了实际挑战。我们在许多生物数据集上系统地评估了所有 ESM2 模型的性能，以确定模型大小对迁移学习的影响。令人惊讶的是，较大的模型并不总是优于较小的模型，尤其是在数据有限的情况下。中等大小的模型（如 ESM2 650M）表现出一致的性能，尽管比 15B 参数模型小 20 多倍，但也只是略微落后于后者。此外，我们还比较了各种嵌入式压缩方法，以确定最有效的方法，结果发现平均嵌入式的性能始终优于其他压缩方法。我们的研究结果表明，采用均值嵌入的 ESM2 650M 在性能和效率之间达到了最佳平衡，使其成为在各种生物应用中进行迁移学习的实用且可扩展的选择：这项研究挑战了 "大型语言模型总能产生更好的结果 "这一普遍观点，在蛋白质生物化学领域也是如此。通过在迁移学习任务中系统地比较不同大小的转换器模型，我们证明了中等大小的模型（如 ESM2 650M）经常与较大的变体模型表现一样好，特别是在数据有限的情况下。这些发现为基于机器学习的蛋白质分析提供了更有效的策略，并促进了人工智能在生物学中的广泛应用。更小、更高效的模型有助于先进机器学习工具的平民化，使计算资源有限的研究人员更容易获得这些工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量