Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang
{"title":"LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library","authors":"Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang","doi":"arxiv-2408.06150","DOIUrl":null,"url":null,"abstract":"In this study, we generate and maintain a database of 10 million virtual\nlipids through METiS's in-house de novo lipid generation algorithms and lipid\nvirtual screening techniques. These virtual lipids serve as a corpus for\npre-training, lipid representation learning, and downstream task knowledge\ntransfer, culminating in state-of-the-art LNP property prediction performance.\nWe propose LipidBERT, a BERT-like model pre-trained with the Masked Language\nModel (MLM) and various secondary tasks. Additionally, we compare the\nperformance of embeddings generated by LipidBERT and PhatGPT, our GPT-like\nlipid generation model, on downstream tasks. The proposed bilingual LipidBERT\nmodel operates in two languages: the language of ionizable lipid pre-training,\nusing in-house dry-lab lipid structures, and the language of LNP fine-tuning,\nutilizing in-house LNP wet-lab data. This dual capability positions LipidBERT\nas a key AI-based filter for future screening tasks, including new versions of\nMETiS de novo lipid libraries and, more importantly, candidates for in vivo\ntesting for orgran-targeting LNPs. To the best of our knowledge, this is the\nfirst successful demonstration of the capability of a pre-trained language\nmodel on virtual lipids and its effectiveness in downstream tasks using web-lab\ndata. This work showcases the clever utilization of METiS's in-house de novo\nlipid library as well as the power of dry-wet lab integration.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we generate and maintain a database of 10 million virtual
lipids through METiS's in-house de novo lipid generation algorithms and lipid
virtual screening techniques. These virtual lipids serve as a corpus for
pre-training, lipid representation learning, and downstream task knowledge
transfer, culminating in state-of-the-art LNP property prediction performance.
We propose LipidBERT, a BERT-like model pre-trained with the Masked Language
Model (MLM) and various secondary tasks. Additionally, we compare the
performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like
lipid generation model, on downstream tasks. The proposed bilingual LipidBERT
model operates in two languages: the language of ionizable lipid pre-training,
using in-house dry-lab lipid structures, and the language of LNP fine-tuning,
utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT
as a key AI-based filter for future screening tasks, including new versions of
METiS de novo lipid libraries and, more importantly, candidates for in vivo
testing for orgran-targeting LNPs. To the best of our knowledge, this is the
first successful demonstration of the capability of a pre-trained language
model on virtual lipids and its effectiveness in downstream tasks using web-lab
data. This work showcases the clever utilization of METiS's in-house de novo
lipid library as well as the power of dry-wet lab integration.