Mehrdad Jalali,A D Dinga Wonanke,Pascal Friederich,Christof Wöll
{"title":"The Black Hole Strategy: Gravity-Based Representative Sampling for Frugal Graph Learning on Metal-Organic Framework Networks.","authors":"Mehrdad Jalali,A D Dinga Wonanke,Pascal Friederich,Christof Wöll","doi":"10.1021/acs.jcim.5c01518","DOIUrl":null,"url":null,"abstract":"The expansion of large-scale materials databases has facilitated the development of graph-based representations, encoding structural and functional similarities as edges in data-driven networks. These enable machine learning models to leverage both local features and global relationships. However, densely connected datasets often introduce redundancy and noise, escalating computational complexity without improving performance. Here, we introduce the Black Hole Strategy, a gravity-based representative sampling method that constructs compact, informative subsets from large materials datasets while preserving essential structural and property diversity. Using metal-organic frameworks (MOFs) as a case study, we demonstrate that graph neural networks (GraphSAGE, GCN, and GAT) trained on Black Hole-sparsified datasets achieve comparable or superior classification and regression performance compared to full-dataset models, despite utilizing significantly fewer data points and reduced memory and training time requirements. Analysis of class-level confusion matrices confirms that critical structure-property relationships─such as pore-limiting diameter (PLD)─persist under substantial sparsification. An ablation study on gravity score weights validates the balanced formulation and robustness of the approach. Topological and efficiency benchmarks further demonstrate that the method preserves modularity, diversity, and connectivity across sparsification levels. These findings establish the Black Hole Strategy as a principled and frugal approach for machine learning in materials science, enabling efficient, interpretable, and scalable discovery workflows. Importantly, this work contributes to the objectives of the FAIRmat consortium, which aims to develop a FAIR data infrastructure for condensed matter physics and materials science. Our approach advances FAIR (Findable, Accessible, Interoperable, Reusable) data practices through optimized sampling techniques that enhance data management, reusability, and interoperability in materials informatics.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"33 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c01518","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0
Abstract
The expansion of large-scale materials databases has facilitated the development of graph-based representations, encoding structural and functional similarities as edges in data-driven networks. These enable machine learning models to leverage both local features and global relationships. However, densely connected datasets often introduce redundancy and noise, escalating computational complexity without improving performance. Here, we introduce the Black Hole Strategy, a gravity-based representative sampling method that constructs compact, informative subsets from large materials datasets while preserving essential structural and property diversity. Using metal-organic frameworks (MOFs) as a case study, we demonstrate that graph neural networks (GraphSAGE, GCN, and GAT) trained on Black Hole-sparsified datasets achieve comparable or superior classification and regression performance compared to full-dataset models, despite utilizing significantly fewer data points and reduced memory and training time requirements. Analysis of class-level confusion matrices confirms that critical structure-property relationships─such as pore-limiting diameter (PLD)─persist under substantial sparsification. An ablation study on gravity score weights validates the balanced formulation and robustness of the approach. Topological and efficiency benchmarks further demonstrate that the method preserves modularity, diversity, and connectivity across sparsification levels. These findings establish the Black Hole Strategy as a principled and frugal approach for machine learning in materials science, enabling efficient, interpretable, and scalable discovery workflows. Importantly, this work contributes to the objectives of the FAIRmat consortium, which aims to develop a FAIR data infrastructure for condensed matter physics and materials science. Our approach advances FAIR (Findable, Accessible, Interoperable, Reusable) data practices through optimized sampling techniques that enhance data management, reusability, and interoperability in materials informatics.
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.