Victor dos Reis Rodrigues, Víctor de Souza Assumção Bonfim, Demétrio Antônio da Silva Filho
{"title":"Machine learning-driven prediction of organic solar cell performance: a data-centric approach to molecular design","authors":"Victor dos Reis Rodrigues, Víctor de Souza Assumção Bonfim, Demétrio Antônio da Silva Filho","doi":"10.1007/s00894-025-06514-5","DOIUrl":null,"url":null,"abstract":"<div><h3>Context</h3><p>Organic solar cells (OSCs) offer a promising route toward flexible and sustainable energy technologies, yet predictive modeling of device parameters remains challenging due to the chemical diversity of donor–acceptor systems and morphology-dependent effects. In this work, we present the first systematic demonstration of using autoencoder-compressed molecular fingerprints with tree-based machine learning models to predict key OSC performance metrics—power conversion efficiency (PCE), open-circuit voltage (V<sub>oc</sub>), short-circuit current (J<sub>sc</sub>), and fill factor (FF)—from a broad experimental dataset of 2500 donor–acceptor pairs, including both fullerene and non-fullerene acceptors. These compact models, trained on compressed descriptors of only 32 dimensions, achieved strong predictive accuracy (Pearson <span>\\(r > 0.95\\)</span>, <span>\\(MAE < 0.4\\)</span>, <span>\\(RMSE < 0.95\\)</span>) while remaining lightweight enough to run on standard computing hardware. As a complementary result, some <i>k</i>-nearest neighbor models achieved near-perfect correlations (<span>\\(r \\sim 0.99\\)</span>) and quite small errors (<span>\\(MAE < 0.044\\)</span> and <span>\\(RMSE<0.4\\)</span>) in general, demonstrating the surprising strength of simple, instance-based learners when sufficient descriptive features are available. Supporting analyses reveal that fullerene datasets are more easily modeled than chemically diverse non-fullerene sets, that fingerprints encode substantial structural information, and that kernel density analyses identify critical ranges of molecular weight and energy offsets for high-efficiency devices. Collectively, this study establishes compressed fingerprint descriptors as a powerful, computationally inexpensive foundation for predictive modeling in OSCs, while also showcasing the unexpected efficacy of k-NN models trained on conventional descriptors. Together, these approaches provide a scalable path toward high-throughput prediction and guided molecular design of next-generation organic photovoltaic materials.</p><h3>Methods</h3><p>The dataset used in this work comprises approximately 2500 experimentally characterized donor–acceptor pairs from bulk heterojunction OSCs. These include both fullerene and non-fullerene acceptor systems. For each pair, the database provides electronic descriptors, polymerization-related metrics, and the SMILES representations of the donor and acceptor molecules. Molecular fingerprints were computed from SMILES codes using the RDKit and CDK cheminformatics toolkits. A variety of machine learning models were explored, including feedforward neural networks, autoencoders for feature compression, tree-based ensemble methods, and kernel-based regression algorithms. Hyperparameter tuning was carried out using the Optuna and BayesSearchCV libraries to ensure optimal model performance.</p></div>","PeriodicalId":651,"journal":{"name":"Journal of Molecular Modeling","volume":"31 11","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Molecular Modeling","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1007/s00894-025-06514-5","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Context
Organic solar cells (OSCs) offer a promising route toward flexible and sustainable energy technologies, yet predictive modeling of device parameters remains challenging due to the chemical diversity of donor–acceptor systems and morphology-dependent effects. In this work, we present the first systematic demonstration of using autoencoder-compressed molecular fingerprints with tree-based machine learning models to predict key OSC performance metrics—power conversion efficiency (PCE), open-circuit voltage (Voc), short-circuit current (Jsc), and fill factor (FF)—from a broad experimental dataset of 2500 donor–acceptor pairs, including both fullerene and non-fullerene acceptors. These compact models, trained on compressed descriptors of only 32 dimensions, achieved strong predictive accuracy (Pearson \(r > 0.95\), \(MAE < 0.4\), \(RMSE < 0.95\)) while remaining lightweight enough to run on standard computing hardware. As a complementary result, some k-nearest neighbor models achieved near-perfect correlations (\(r \sim 0.99\)) and quite small errors (\(MAE < 0.044\) and \(RMSE<0.4\)) in general, demonstrating the surprising strength of simple, instance-based learners when sufficient descriptive features are available. Supporting analyses reveal that fullerene datasets are more easily modeled than chemically diverse non-fullerene sets, that fingerprints encode substantial structural information, and that kernel density analyses identify critical ranges of molecular weight and energy offsets for high-efficiency devices. Collectively, this study establishes compressed fingerprint descriptors as a powerful, computationally inexpensive foundation for predictive modeling in OSCs, while also showcasing the unexpected efficacy of k-NN models trained on conventional descriptors. Together, these approaches provide a scalable path toward high-throughput prediction and guided molecular design of next-generation organic photovoltaic materials.
Methods
The dataset used in this work comprises approximately 2500 experimentally characterized donor–acceptor pairs from bulk heterojunction OSCs. These include both fullerene and non-fullerene acceptor systems. For each pair, the database provides electronic descriptors, polymerization-related metrics, and the SMILES representations of the donor and acceptor molecules. Molecular fingerprints were computed from SMILES codes using the RDKit and CDK cheminformatics toolkits. A variety of machine learning models were explored, including feedforward neural networks, autoencoders for feature compression, tree-based ensemble methods, and kernel-based regression algorithms. Hyperparameter tuning was carried out using the Optuna and BayesSearchCV libraries to ensure optimal model performance.
背景:有机太阳能电池(OSCs)为灵活和可持续的能源技术提供了一条有前途的途径,但由于供体-受体系统的化学多样性和形态依赖效应,器件参数的预测建模仍然具有挑战性。在这项工作中,我们首次系统地展示了使用自动编码器压缩分子指纹和基于树的机器学习模型来预测OSC的关键性能指标——功率转换效率(PCE)、开路电压(Voc)、短路电流(Jsc)和填充因子(FF)——来自2500对供体-受体对的广泛实验数据集,包括富勒烯和非富勒烯受体。这些紧凑的模型,在只有32维的压缩描述符上进行训练,获得了很强的预测精度(Pearson r 0.95, m.m.e 0.4, r.m.s E 0.95),同时保持了足够的轻量级,可以在标准计算硬件上运行。作为补充结果,一些k近邻模型总体上实现了近乎完美的相关性(r ~ 0.99)和相当小的误差(m.a.e 0.044和r.m.s E 0.4),这表明当有足够的描述性特征可用时,简单的、基于实例的学习器具有惊人的力量。支持分析表明,富勒烯数据集比化学多样性的非富勒烯数据集更容易建模,指纹编码了大量的结构信息,核密度分析确定了高效器件的分子量和能量偏移的临界范围。总的来说,本研究将压缩指纹描述符建立为OSCs中预测建模的强大且计算成本低廉的基础,同时也展示了在传统描述符上训练的k-NN模型的意想不到的功效。总之,这些方法为下一代有机光伏材料的高通量预测和引导分子设计提供了可扩展的途径。方法:本工作中使用的数据集包括大约2500个实验表征的来自大块异质结osc的供体-受体对。这包括富勒烯和非富勒烯受体系统。对于每一对,数据库提供电子描述符,聚合相关指标,以及供体和受体分子的SMILES表示。使用RDKit和CDK化学信息学工具包从SMILES代码中计算分子指纹。探索了各种机器学习模型,包括前馈神经网络、特征压缩的自编码器、基于树的集成方法和基于核的回归算法。使用Optuna和BayesSearchCV库进行超参数调优,以确保最优的模型性能。
期刊介绍:
The Journal of Molecular Modeling focuses on "hardcore" modeling, publishing high-quality research and reports. Founded in 1995 as a purely electronic journal, it has adapted its format to include a full-color print edition, and adjusted its aims and scope fit the fast-changing field of molecular modeling, with a particular focus on three-dimensional modeling.
Today, the journal covers all aspects of molecular modeling including life science modeling; materials modeling; new methods; and computational chemistry.
Topics include computer-aided molecular design; rational drug design, de novo ligand design, receptor modeling and docking; cheminformatics, data analysis, visualization and mining; computational medicinal chemistry; homology modeling; simulation of peptides, DNA and other biopolymers; quantitative structure-activity relationships (QSAR) and ADME-modeling; modeling of biological reaction mechanisms; and combined experimental and computational studies in which calculations play a major role.