Evaluating the generalizability of graph neural networks for predicting collision cross section

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2024-08-29 DOI:10.1186/s13321-024-00899-w

Chloe Engler Hart, António José Preto, Shaurya Chanana, David Healey, Tobias Kind, Daniel Domingo-Fernández

{"title":"Evaluating the generalizability of graph neural networks for predicting collision cross section","authors":"Chloe Engler Hart, António José Preto, Shaurya Chanana, David Healey, Tobias Kind, Daniel Domingo-Fernández","doi":"10.1186/s13321-024-00899-w","DOIUrl":null,"url":null,"abstract":"<div><p>Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.</p><p><b>Scientific contribution</b></p><p>We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00899-w","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00899-w","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.

Scientific contribution

We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.

查看原文本刊更多论文

评估图神经网络预测碰撞截面的通用性

离子迁移率与质谱联用（IM-MS）是一种很有前途的分析技术，它通过测量碰撞截面（CCS）值来提高分子表征能力，而碰撞截面值是分子大小和形状的指标。然而，由于实验数据有限，CCS 值在结构分析中的有效应用仍然受到限制，因此有必要开发精确的机器学习（ML）模型进行硅学预测。在本研究中，我们评估了最先进的图神经网络（GNN），并使用迄今为止最大的公开可用数据集对其进行了训练，以预测 CCS 值。尽管我们的结果证实了这些模型在与其训练环境相似的化学空间内具有很高的准确性，但当它们应用于结构新颖的区域时，其性能却明显下降。这种差异引起了人们对硅学 CCS 预测可靠性的担忧，并强调了进一步发布公开 CCS 数据集的必要性。为了缓解这一问题，我们引入了 Mol2CCS，它展示了如何通过扩展模型来考虑分子指纹、描述符和分子类型等额外特征，从而部分提高通用性。最后，我们还展示了置信模型如何通过提高 CCS 估计值的可靠性来提供支持。科学贡献我们对用于预测碰撞截面的最先进图神经网络进行了基准测试。我们的工作强调了这些模型在类似化学空间中进行训练和预测时的准确性，但也强调了在结构新颖的区域中进行评估时其准确性是如何下降的。最后，我们提出了缓解这一问题的潜在方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.