VitroBert: modeling DILI by pretraining BERT on in vitro data

IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen
{"title":"VitroBert: modeling DILI by pretraining BERT on in vitro data","authors":"Muhammad Arslan Masood,&nbsp;Anamya Ajjolli Nagaraja,&nbsp;Katia Belaid,&nbsp;Natalie Mesens,&nbsp;Hugo Ceulemans,&nbsp;Samuel Kaski,&nbsp;Dorota Herman,&nbsp;Markus Heinonen","doi":"10.1186/s13321-025-01048-7","DOIUrl":null,"url":null,"abstract":"<div><p>Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01048-7","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-01048-7","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks.

VitroBert:通过在体外数据上预训练BERT来建模DILI
药物性肝损伤(DILI)由于其复杂性、小数据集和严重的分类不平衡而面临重大挑战。虽然无监督预训练是学习下游任务分子表示的常用方法,但它通常缺乏对分子如何与生物系统相互作用的了解。因此,我们介绍了VitroBERT,这是一种双向编码器表示,来自变压器(BERT)模型,在大规模体外分析谱上进行预训练,以产生生物学信息的分子嵌入。当用于预测体内DILI终点时,与无监督预训练相比,这些嵌入在生物化学相关任务中提高了29%,在组织病理学终点上提高了16% (MolBERT)。然而,在临床任务中没有观察到明显的改善。此外,为了解决类不平衡的关键问题,我们评估了多个损失函数,包括BCE、加权BCE、Focal loss和加权Focal loss,并确定加权Focal loss是最有效的。我们的研究结果证明了将生物学背景整合到分子模型中的潜力,并强调了选择适当的损失函数在提高高度不平衡的dili相关任务的模型性能方面的重要性。我们介绍vitrobert -一个通过结合生物监督扩展传统分子预训练的一般框架。我们的研究结果表明,富含体外相互作用的分子嵌入优于仅基于化学数据的分子嵌入。这些发现强化了体外数据在获取生物学相关信息方面的价值,并强调了该模型利用这些特征来模拟DILI风险的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信