VitroBert: modeling DILI by pretraining BERT on in vitro data

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2025-08-06 DOI:10.1186/s13321-025-01048-7

Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen

{"title":"VitroBert: modeling DILI by pretraining BERT on in vitro data","authors":"Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen","doi":"10.1186/s13321-025-01048-7","DOIUrl":null,"url":null,"abstract":"<div><p>Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01048-7","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-01048-7","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks.

查看原文本刊更多论文

VitroBert：通过在体外数据上预训练BERT来建模DILI

药物性肝损伤（DILI）由于其复杂性、小数据集和严重的分类不平衡而面临重大挑战。虽然无监督预训练是学习下游任务分子表示的常用方法，但它通常缺乏对分子如何与生物系统相互作用的了解。因此，我们介绍了VitroBERT，这是一种双向编码器表示，来自变压器（BERT）模型，在大规模体外分析谱上进行预训练，以产生生物学信息的分子嵌入。当用于预测体内DILI终点时，与无监督预训练相比，这些嵌入在生物化学相关任务中提高了29%，在组织病理学终点上提高了16% （MolBERT）。然而，在临床任务中没有观察到明显的改善。此外，为了解决类不平衡的关键问题，我们评估了多个损失函数，包括BCE、加权BCE、Focal loss和加权Focal loss，并确定加权Focal loss是最有效的。我们的研究结果证明了将生物学背景整合到分子模型中的潜力，并强调了选择适当的损失函数在提高高度不平衡的dili相关任务的模型性能方面的重要性。我们介绍vitrobert -一个通过结合生物监督扩展传统分子预训练的一般框架。我们的研究结果表明，富含体外相互作用的分子嵌入优于仅基于化学数据的分子嵌入。这些发现强化了体外数据在获取生物学相关信息方面的价值，并强调了该模型利用这些特征来模拟DILI风险的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.