Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2024-12-03 DOI:10.1186/s13321-024-00932-y

Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris

{"title":"Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints","authors":"Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris","doi":"10.1186/s13321-024-00932-y","DOIUrl":null,"url":null,"abstract":"<div>Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning. Scientific contribution A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00932-y","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00932-y","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Scientific contribution

A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.

查看原文本刊更多论文

Sort & Slice：对于扩展连接指纹，它是基于散列的折叠的一个简单而优越的替代方案

扩展连接指纹（ECFPs）是当前化学信息学和分子机器学习中普遍使用的工具，也是用于化学预测的最流行的分子特征提取技术之一。通过图神经网络学习的原子特征可以使用大量的图池方法聚合为复合级表示。相反，检测到的ECFP子结构集在默认情况下仅使用简单的基于哈希的折叠过程转换为位向量。我们通过一种称为子结构池的正式操作引入了一个用于结构指纹矢量化的通用数学框架，该操作包括基于哈希的折叠、算法子结构选择和各种其他潜在技术。我们继续描述Sort & Slice，这是一种易于实现且无位冲突的ECFP子结构池化的基于哈希的折叠替代方案。Sort & Slice首先根据ECFP子结构在给定训练化合物中的相对流行度对其进行排序，然后切片除L个最常见的子结构外的所有子结构，这些子结构随后用于生成所需长度L的二进制指纹。我们计算比较了基于哈希的折叠，Sort & Slice和两种高级监督子结构选择方案（过滤和互信息最大化）的性能，用于基于ECFP的分子性质预测。我们的研究结果表明，尽管其技术简单，但Sort & Slice在不同的预测任务、数据分割技术、机器学习模型和ECFP超参数上稳健地（有时显著地）优于传统的基于哈希的折叠以及其他研究的子结构池方法。因此，我们建议Sort & Slice通常取代基于哈希的折叠作为默认的子结构池技术，以矢量化ecfp用于监督分子机器学习。结构指纹矢量化的一般数学框架称为子结构池；Sort & Slice的技术描述和计算评估，这是一种概念简单且无位碰撞的ECFP子结构池化方法，在分子性质预测方面明显优于经典的基于哈希的折叠。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.