LipidBERT：在 METiS 全新脂质库上预先训练的脂质语言模型

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-12 DOI:arxiv-2408.06150

Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang

{"title":"LipidBERT：在 METiS 全新脂质库上预先训练的脂质语言模型","authors":"Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang","doi":"arxiv-2408.06150","DOIUrl":null,"url":null,"abstract":"In this study, we generate and maintain a database of 10 million virtual\nlipids through METiS's in-house de novo lipid generation algorithms and lipid\nvirtual screening techniques. These virtual lipids serve as a corpus for\npre-training, lipid representation learning, and downstream task knowledge\ntransfer, culminating in state-of-the-art LNP property prediction performance.\nWe propose LipidBERT, a BERT-like model pre-trained with the Masked Language\nModel (MLM) and various secondary tasks. Additionally, we compare the\nperformance of embeddings generated by LipidBERT and PhatGPT, our GPT-like\nlipid generation model, on downstream tasks. The proposed bilingual LipidBERT\nmodel operates in two languages: the language of ionizable lipid pre-training,\nusing in-house dry-lab lipid structures, and the language of LNP fine-tuning,\nutilizing in-house LNP wet-lab data. This dual capability positions LipidBERT\nas a key AI-based filter for future screening tasks, including new versions of\nMETiS de novo lipid libraries and, more importantly, candidates for in vivo\ntesting for orgran-targeting LNPs. To the best of our knowledge, this is the\nfirst successful demonstration of the capability of a pre-trained language\nmodel on virtual lipids and its effectiveness in downstream tasks using web-lab\ndata. This work showcases the clever utilization of METiS's in-house de novo\nlipid library as well as the power of dry-wet lab integration.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library\",\"authors\":\"Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang\",\"doi\":\"arxiv-2408.06150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we generate and maintain a database of 10 million virtual\\nlipids through METiS's in-house de novo lipid generation algorithms and lipid\\nvirtual screening techniques. These virtual lipids serve as a corpus for\\npre-training, lipid representation learning, and downstream task knowledge\\ntransfer, culminating in state-of-the-art LNP property prediction performance.\\nWe propose LipidBERT, a BERT-like model pre-trained with the Masked Language\\nModel (MLM) and various secondary tasks. Additionally, we compare the\\nperformance of embeddings generated by LipidBERT and PhatGPT, our GPT-like\\nlipid generation model, on downstream tasks. The proposed bilingual LipidBERT\\nmodel operates in two languages: the language of ionizable lipid pre-training,\\nusing in-house dry-lab lipid structures, and the language of LNP fine-tuning,\\nutilizing in-house LNP wet-lab data. This dual capability positions LipidBERT\\nas a key AI-based filter for future screening tasks, including new versions of\\nMETiS de novo lipid libraries and, more importantly, candidates for in vivo\\ntesting for orgran-targeting LNPs. To the best of our knowledge, this is the\\nfirst successful demonstration of the capability of a pre-trained language\\nmodel on virtual lipids and its effectiveness in downstream tasks using web-lab\\ndata. This work showcases the clever utilization of METiS's in-house de novo\\nlipid library as well as the power of dry-wet lab integration.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.06150\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本研究中，我们通过 METiS 内部的新脂质生成算法和脂质虚拟筛选技术，生成并维护了一个包含 1000 万个虚拟脂质的数据库。我们提出了 LipidBERT 模型，这是一种类似于 BERT 的模型，使用掩码语言模型（MLM）和各种辅助任务进行预训练。此外，我们还比较了 LipidBERT 和 PhatGPT（我们的 GPT 脂质生成模型）生成的嵌入词在下游任务中的表现。所提出的双语 LipidBERT 模型使用两种语言：使用内部干实验室脂质结构的可离子化脂质预训练语言和使用内部 LNP 湿实验室数据的 LNP 微调语言。这种双重能力将 LipidBERT 定位为未来筛选任务的关键人工智能过滤器，包括新版METiS 新生脂质文库，更重要的是，用于活体测试斲丧者靶向 LNPs 的候选者。据我们所知，这是首次成功展示预先训练的虚拟脂质语言模型的能力及其在下游任务中使用网络实验室数据的有效性。这项工作展示了 METiS 内部新脂类库的巧妙利用以及干湿实验室集成的强大功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked Language Model (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained language model on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS's in-house de novo lipid library as well as the power of dry-wet lab integration.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量