基因组语言模型：机遇与挑战

ArXiv Pub Date : 2024-09-22

Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S Song

{"title":"基因组语言模型：机遇与挑战","authors":"Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S Song","doi":"","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11275703/pdf/","citationCount":"0","resultStr":"{\"title\":\"Genomic Language Models: Opportunities and Challenges.\",\"authors\":\"Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S Song\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11275703/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLMs）正在广泛的科学领域产生变革性影响，尤其是在生物医学科学领域。正如自然语言处理的目标是理解单词序列一样，生物学的一个主要目标是理解生物序列。基因组语言模型（gLMs）是在 DNA 序列上训练的 LLMs，有可能极大地推动我们对基因组以及不同尺度的 DNA 元素如何相互作用产生复杂功能的理解。在这篇综述中，我们将重点介绍 gLMs 的关键应用，包括适配性预测、序列设计和迁移学习，从而展示这种潜力。然而，尽管最近取得了显著进展，开发有效和高效的 gLMs 仍然面临着诸多挑战，尤其是对于基因组庞大而复杂的物种而言。我们将讨论开发和评估 gLMs 的主要注意事项。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Genomic Language Models: Opportunities and Challenges.

Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ArXiv

自引率

0.00%

发文量