Andrew F. Zahrt, Jeremy J. Henle, Scott E. Denmark*
{"title":"使用组合数据集进行机器学习研究的警示指南","authors":"Andrew F. Zahrt, Jeremy J. Henle, Scott E. Denmark*","doi":"10.1021/acscombsci.0c00118","DOIUrl":null,"url":null,"abstract":"<p >Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or absence of a reactant or catalyst) rather than to identify meaningful trends between descriptors and the response variable. Consequently, the generality and interpretability of such models suffer. This report illustrates these well-known pitfalls in a case study, demonstrates the necessary control experiments to identify when this property will be problematic, and suggests how to perform further validation to assess general applicability and interpretability of models trained using combinatorial datasets.</p>","PeriodicalId":14,"journal":{"name":"ACS Combinatorial Science","volume":"22 11","pages":"586–591"},"PeriodicalIF":3.7840,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1021/acscombsci.0c00118","citationCount":"20","resultStr":"{\"title\":\"Cautionary Guidelines for Machine Learning Studies with Combinatorial Datasets\",\"authors\":\"Andrew F. Zahrt, Jeremy J. Henle, Scott E. Denmark*\",\"doi\":\"10.1021/acscombsci.0c00118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or absence of a reactant or catalyst) rather than to identify meaningful trends between descriptors and the response variable. Consequently, the generality and interpretability of such models suffer. This report illustrates these well-known pitfalls in a case study, demonstrates the necessary control experiments to identify when this property will be problematic, and suggests how to perform further validation to assess general applicability and interpretability of models trained using combinatorial datasets.</p>\",\"PeriodicalId\":14,\"journal\":{\"name\":\"ACS Combinatorial Science\",\"volume\":\"22 11\",\"pages\":\"586–591\"},\"PeriodicalIF\":3.7840,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1021/acscombsci.0c00118\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Combinatorial Science\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acscombsci.0c00118\",\"RegionNum\":3,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Chemistry\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Combinatorial Science","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acscombsci.0c00118","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Chemistry","Score":null,"Total":0}
Cautionary Guidelines for Machine Learning Studies with Combinatorial Datasets
Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or absence of a reactant or catalyst) rather than to identify meaningful trends between descriptors and the response variable. Consequently, the generality and interpretability of such models suffer. This report illustrates these well-known pitfalls in a case study, demonstrates the necessary control experiments to identify when this property will be problematic, and suggests how to perform further validation to assess general applicability and interpretability of models trained using combinatorial datasets.
期刊介绍:
The Journal of Combinatorial Chemistry has been relaunched as ACS Combinatorial Science under the leadership of new Editor-in-Chief M.G. Finn of The Scripps Research Institute. The journal features an expanded scope and will build upon the legacy of the Journal of Combinatorial Chemistry, a highly cited leader in the field.