考虑数据集和实验条件不平衡的数据驱动方法研究光催化剂

IF 3.7 3区 化学 Q2 CHEMISTRY, MULTIDISCIPLINARY
Wataru Takahara, Ryuto Baba, Yosuke Harashima*, Tomoaki Takayama, Shogo Takasuka, Yuichi Yamaguchi, Akihiko Kudo* and Mikiya Fujii*, 
{"title":"考虑数据集和实验条件不平衡的数据驱动方法研究光催化剂","authors":"Wataru Takahara,&nbsp;Ryuto Baba,&nbsp;Yosuke Harashima*,&nbsp;Tomoaki Takayama,&nbsp;Shogo Takasuka,&nbsp;Yuichi Yamaguchi,&nbsp;Akihiko Kudo* and Mikiya Fujii*,&nbsp;","doi":"10.1021/acsomega.4c0699710.1021/acsomega.4c06997","DOIUrl":null,"url":null,"abstract":"<p >In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.</p>","PeriodicalId":22,"journal":{"name":"ACS Omega","volume":"10 15","pages":"14626–14639 14626–14639"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acsomega.4c06997","citationCount":"0","resultStr":"{\"title\":\"Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts\",\"authors\":\"Wataru Takahara,&nbsp;Ryuto Baba,&nbsp;Yosuke Harashima*,&nbsp;Tomoaki Takayama,&nbsp;Shogo Takasuka,&nbsp;Yuichi Yamaguchi,&nbsp;Akihiko Kudo* and Mikiya Fujii*,&nbsp;\",\"doi\":\"10.1021/acsomega.4c0699710.1021/acsomega.4c06997\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.</p>\",\"PeriodicalId\":22,\"journal\":{\"name\":\"ACS Omega\",\"volume\":\"10 15\",\"pages\":\"14626–14639 14626–14639\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.acs.org/doi/epdf/10.1021/acsomega.4c06997\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Omega\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acsomega.4c06997\",\"RegionNum\":3,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Omega","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acsomega.4c06997","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

在数据驱动的材料开发领域,由于数据点集中在特定区域,导致数据集的不平衡,在应用机器学习方法时,往往难以建立回归模型。无机功能材料面临这种困难的一个例子是光催化剂。因此,即使数据集存在不平衡,先进的数据驱动方法也有望帮助有效地开发新型光催化材料。我们提出了一个两阶段的机器学习模型,旨在处理不平衡的数据集,而不需要数据细化。在这项研究中,我们使用了两种表现出不平衡的数据集:材料项目数据集(由于其公共领域数据而公开共享)和内部金属硫化物光催化剂数据集(由于实验数据的机密性而未公开共享)。这个两阶段的机器学习模型由以下两部分组成:第一个回归模型,用于定量预测目标,第二个分类模型,用于确定第一个回归模型预测值的可靠性。我们还提出了一种基于所提出的两阶段机器学习模型的与实验条件相关的变量的搜索方案。该方案是为光催化剂的探索而设计的,考虑了实验条件,因为这些条件的最佳变量集是未知的。与单阶段模型相比,本文提出的两阶段机器学习模型提高了目标的预测精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts

In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACS Omega
ACS Omega Chemical Engineering-General Chemical Engineering
CiteScore
6.60
自引率
4.90%
发文量
3945
审稿时长
2.4 months
期刊介绍: ACS Omega is an open-access global publication for scientific articles that describe new findings in chemistry and interfacing areas of science, without any perceived evaluation of immediate impact.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信