Accelerated prediction of molecular properties for per- and polyfluoroalkyl substances using graph neural networks with adjacency-free message passing

IF 7.3 2区 环境科学与生态学 Q1 ENVIRONMENTAL SCIENCES
Hector Medina, Rachel Drake, Carson Farmer
{"title":"Accelerated prediction of molecular properties for per- and polyfluoroalkyl substances using graph neural networks with adjacency-free message passing","authors":"Hector Medina,&nbsp;Rachel Drake,&nbsp;Carson Farmer","doi":"10.1016/j.envpol.2025.126705","DOIUrl":null,"url":null,"abstract":"<div><div>The molecular contaminant chemical space is vast, necessitating the development of methods and tools to accelerate the computation of molecular properties, study interactions, and ultimately aid to the engineering of technological solutions for environmental remediation and exposome reduction. Graph neural networks (GNNs) offer a promising approach due to their structural similarity to molecular graphs and their ability to learn complex relationships through graph-based structures. However, GNN-based model training can be computationally expensive, especially for large molecular datasets. In this work, we evaluated the predictive performance of a novel Graph-Enhanced multilayer perceptron (GE-MLP) on molecular properties of per- and polyfluoroalkyl substances (PFAS), and compared it against the performances of two traditional GNN-based architectures, namely Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). The GE-MLP architecture, which incorporates structural information into a dense neural network framework, was trained on and validated on a dataset of 15,000 PFAS, generated using tight-binding methods, and calibrated against experimental results. The targeted properties were electron affinity (EA), ionization potential (IP), and HOMO–LUMO gap (HL). In contrast to traditional graph-based architectures, GE-MLP offers the advantages of processing molecular fingerprints and node-level descriptors in a purely feedforward manner, embedding structural information using molecular fingerprints and node-level descriptors in place of adjacency-based message passing. Our findings reinforce the usefulness of graph-based architectures in predicting molecular properties of complex contaminants such as PFAS, as compared against traditional machine learning (ML) models. Furthermore, the GE-MLP emerged as a strong GNN-based contender, demonstrating the highest predictive performance for IP, suggesting that integrating structural information via atomic and fingerprint based molecular descriptors into dense neural networks offers a viable alternative to adjacency-based message passing methods. Finally, our GE-MLP provides a computationally efficient alternative to other GNN-based methods due to savings in model training, offering a scalable, message-passing-free approach to molecular property prediction while retaining structural awareness. Future work includes the expansion of the data set to 3.5 million fluorinated compounds to improve generalization, as well as architectural improvements that include transfer learning, topological embeddings, and hybrid models to further advance predictive accuracy and applicability.</div></div>","PeriodicalId":311,"journal":{"name":"Environmental Pollution","volume":"382 ","pages":"Article 126705"},"PeriodicalIF":7.3000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Pollution","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0269749125010784","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

The molecular contaminant chemical space is vast, necessitating the development of methods and tools to accelerate the computation of molecular properties, study interactions, and ultimately aid to the engineering of technological solutions for environmental remediation and exposome reduction. Graph neural networks (GNNs) offer a promising approach due to their structural similarity to molecular graphs and their ability to learn complex relationships through graph-based structures. However, GNN-based model training can be computationally expensive, especially for large molecular datasets. In this work, we evaluated the predictive performance of a novel Graph-Enhanced multilayer perceptron (GE-MLP) on molecular properties of per- and polyfluoroalkyl substances (PFAS), and compared it against the performances of two traditional GNN-based architectures, namely Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). The GE-MLP architecture, which incorporates structural information into a dense neural network framework, was trained on and validated on a dataset of 15,000 PFAS, generated using tight-binding methods, and calibrated against experimental results. The targeted properties were electron affinity (EA), ionization potential (IP), and HOMO–LUMO gap (HL). In contrast to traditional graph-based architectures, GE-MLP offers the advantages of processing molecular fingerprints and node-level descriptors in a purely feedforward manner, embedding structural information using molecular fingerprints and node-level descriptors in place of adjacency-based message passing. Our findings reinforce the usefulness of graph-based architectures in predicting molecular properties of complex contaminants such as PFAS, as compared against traditional machine learning (ML) models. Furthermore, the GE-MLP emerged as a strong GNN-based contender, demonstrating the highest predictive performance for IP, suggesting that integrating structural information via atomic and fingerprint based molecular descriptors into dense neural networks offers a viable alternative to adjacency-based message passing methods. Finally, our GE-MLP provides a computationally efficient alternative to other GNN-based methods due to savings in model training, offering a scalable, message-passing-free approach to molecular property prediction while retaining structural awareness. Future work includes the expansion of the data set to 3.5 million fluorinated compounds to improve generalization, as well as architectural improvements that include transfer learning, topological embeddings, and hybrid models to further advance predictive accuracy and applicability.

Abstract Image

Abstract Image

利用无邻接信息传递的图神经网络加速预测单氟烷基和多氟烷基物质的分子性质
分子污染物的化学空间是巨大的,需要开发方法和工具来加速分子性质的计算,研究相互作用,并最终有助于环境修复和减少暴露的技术解决方案的工程。图神经网络(gnn)由于其结构与分子图的相似性以及通过基于图的结构学习复杂关系的能力,提供了一种很有前途的方法。然而,基于gnn的模型训练在计算上可能是昂贵的,特别是对于大型分子数据集。在这项工作中,我们评估了一种新型的图增强多层感知器(GE-MLP)对全氟烷基和多氟烷基物质(PFAS)分子特性的预测性能,并将其与两种传统的基于gnn的架构,即图卷积网络(GCN)和图注意网络(GAT)的性能进行了比较。GE-MLP架构将结构信息整合到密集的神经网络框架中,并在15,000个PFAS数据集上进行了训练和验证,这些数据集使用紧密结合方法生成,并根据实验结果进行了校准。目标性质是电子亲和力(EA)、电离势(IP)和HOMO-LUMO间隙(HL)。与传统的基于图的架构相比,GE-MLP提供了以纯前馈方式处理分子指纹和节点级描述符的优势,使用分子指纹和节点级描述符嵌入结构信息,代替基于邻接的消息传递。与传统的机器学习(ML)模型相比,我们的研究结果加强了基于图的架构在预测复杂污染物(如PFAS)分子特性方面的实用性。此外,GE-MLP成为基于gnn的强有力的竞争者,显示出最高的IP预测性能,这表明通过原子和基于指纹的分子描述符将结构信息集成到密集神经网络中,为基于邻接的消息传递方法提供了一种可行的替代方案。最后,我们的GE-MLP为其他基于gnn的方法提供了一种计算效率高的替代方案,因为它节省了模型训练的时间,为分子性质预测提供了一种可扩展的、无消息传递的方法,同时保留了结构意识。未来的工作包括将数据集扩展到350万种含氟化合物,以提高泛化能力,以及包括迁移学习、拓扑嵌入和混合模型在内的架构改进,以进一步提高预测的准确性和适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Environmental Pollution
Environmental Pollution 环境科学-环境科学
CiteScore
16.00
自引率
6.70%
发文量
2082
审稿时长
2.9 months
期刊介绍: Environmental Pollution is an international peer-reviewed journal that publishes high-quality research papers and review articles covering all aspects of environmental pollution and its impacts on ecosystems and human health. Subject areas include, but are not limited to: • Sources and occurrences of pollutants that are clearly defined and measured in environmental compartments, food and food-related items, and human bodies; • Interlinks between contaminant exposure and biological, ecological, and human health effects, including those of climate change; • Contaminants of emerging concerns (including but not limited to antibiotic resistant microorganisms or genes, microplastics/nanoplastics, electronic wastes, light, and noise) and/or their biological, ecological, or human health effects; • Laboratory and field studies on the remediation/mitigation of environmental pollution via new techniques and with clear links to biological, ecological, or human health effects; • Modeling of pollution processes, patterns, or trends that is of clear environmental and/or human health interest; • New techniques that measure and examine environmental occurrences, transport, behavior, and effects of pollutants within the environment or the laboratory, provided that they can be clearly used to address problems within regional or global environmental compartments.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信