{"title":"基于机器学习和密度泛函理论的导电聚合物带隙和重组能预测","authors":"Tugba Haciefendioglu, and , Erol Yildirim*, ","doi":"10.1021/acs.jcim.5c0034510.1021/acs.jcim.5c00345","DOIUrl":null,"url":null,"abstract":"<p >The performance and reliability of machine learning (ML)-quantitative structure–property relationship (QSPR) models depend on the quality, size, and diversity of the data set used for model training. In this study, we manually curated a large-scale data set containing 3120 donor–acceptor (D–A) conjugated polymers (CPs) by selecting the most utilized 60 donors and 52 acceptors. This data set serves as a valuable resource for ML-based prediction of key electronic properties such as band gap energy (<i>E</i><sub>g</sub>) and hole reorganization energy (λ<sub>h</sub>), calculated using density functional theory (DFT) to advance organic photovoltaics (OPV). Beyond data set construction, we systematically investigated how different descriptor and fingerprint types impact performance of the ML model. Recognizing that not all features contributed equally to the model performance, we conducted an in-depth analysis to identify the most informative descriptors for the fundamental optoelectronic properties. Our findings show that kernel partial least-squares (KPLS) regression utilizing radial and molprint2D fingerprints achieved the highest accuracy in predicting <i>E</i><sub>g</sub>, with <i>R</i><sup>2</sup> values of 0.899 and 0.897, respectively. For λ<sub>h</sub> prediction, models integrating electronic descriptors such as frontier orbital energy levels significantly improved performance, achieving an <i>R</i><sup>2</sup> value of 0.830. This study provides a comprehensive investigation of how different descriptors influence model performance in OPV research. By analyzing why certain models succeed while others fail, our findings offer insight into feature selection and data set optimization for accurate target property prediction in organic electronics. The developed ML models provide a predictive framework for high-performance OPV materials design, significantly reducing the reliance on labor-intensive experimental procedures and computationally expensive first-principle calculations.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"65 11","pages":"5360–5369 5360–5369"},"PeriodicalIF":5.3000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acs.jcim.5c00345","citationCount":"0","resultStr":"{\"title\":\"Band Gap and Reorganization Energy Prediction of Conducting Polymers by the Integration of Machine Learning and Density Functional Theory\",\"authors\":\"Tugba Haciefendioglu, and , Erol Yildirim*, \",\"doi\":\"10.1021/acs.jcim.5c0034510.1021/acs.jcim.5c00345\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >The performance and reliability of machine learning (ML)-quantitative structure–property relationship (QSPR) models depend on the quality, size, and diversity of the data set used for model training. In this study, we manually curated a large-scale data set containing 3120 donor–acceptor (D–A) conjugated polymers (CPs) by selecting the most utilized 60 donors and 52 acceptors. This data set serves as a valuable resource for ML-based prediction of key electronic properties such as band gap energy (<i>E</i><sub>g</sub>) and hole reorganization energy (λ<sub>h</sub>), calculated using density functional theory (DFT) to advance organic photovoltaics (OPV). Beyond data set construction, we systematically investigated how different descriptor and fingerprint types impact performance of the ML model. Recognizing that not all features contributed equally to the model performance, we conducted an in-depth analysis to identify the most informative descriptors for the fundamental optoelectronic properties. Our findings show that kernel partial least-squares (KPLS) regression utilizing radial and molprint2D fingerprints achieved the highest accuracy in predicting <i>E</i><sub>g</sub>, with <i>R</i><sup>2</sup> values of 0.899 and 0.897, respectively. For λ<sub>h</sub> prediction, models integrating electronic descriptors such as frontier orbital energy levels significantly improved performance, achieving an <i>R</i><sup>2</sup> value of 0.830. This study provides a comprehensive investigation of how different descriptors influence model performance in OPV research. By analyzing why certain models succeed while others fail, our findings offer insight into feature selection and data set optimization for accurate target property prediction in organic electronics. The developed ML models provide a predictive framework for high-performance OPV materials design, significantly reducing the reliance on labor-intensive experimental procedures and computationally expensive first-principle calculations.</p>\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"65 11\",\"pages\":\"5360–5369 5360–5369\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.acs.org/doi/epdf/10.1021/acs.jcim.5c00345\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jcim.5c00345\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.5c00345","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
Band Gap and Reorganization Energy Prediction of Conducting Polymers by the Integration of Machine Learning and Density Functional Theory
The performance and reliability of machine learning (ML)-quantitative structure–property relationship (QSPR) models depend on the quality, size, and diversity of the data set used for model training. In this study, we manually curated a large-scale data set containing 3120 donor–acceptor (D–A) conjugated polymers (CPs) by selecting the most utilized 60 donors and 52 acceptors. This data set serves as a valuable resource for ML-based prediction of key electronic properties such as band gap energy (Eg) and hole reorganization energy (λh), calculated using density functional theory (DFT) to advance organic photovoltaics (OPV). Beyond data set construction, we systematically investigated how different descriptor and fingerprint types impact performance of the ML model. Recognizing that not all features contributed equally to the model performance, we conducted an in-depth analysis to identify the most informative descriptors for the fundamental optoelectronic properties. Our findings show that kernel partial least-squares (KPLS) regression utilizing radial and molprint2D fingerprints achieved the highest accuracy in predicting Eg, with R2 values of 0.899 and 0.897, respectively. For λh prediction, models integrating electronic descriptors such as frontier orbital energy levels significantly improved performance, achieving an R2 value of 0.830. This study provides a comprehensive investigation of how different descriptors influence model performance in OPV research. By analyzing why certain models succeed while others fail, our findings offer insight into feature selection and data set optimization for accurate target property prediction in organic electronics. The developed ML models provide a predictive framework for high-performance OPV materials design, significantly reducing the reliance on labor-intensive experimental procedures and computationally expensive first-principle calculations.
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.