{"title":"Two-step hybrid modeling for variable selection and estimation: An application to quantitative structure activity relationship study","authors":"Henrietta Ebele Oranye, Fidelis Ifeanyi Ugwuowo, Kingsley Chinedu Arum","doi":"10.1002/cem.3522","DOIUrl":null,"url":null,"abstract":"<p>In this study, we developed a simple technique for effective parameter estimation and prediction of the quantitative structure activity relationship studies using a two-step procedure. The first step is to choose the important molecular descriptors using the random forest regression, and the second step is to optimally predict the biological activity of the selected chemical compounds using the following estimators: ridge regression, jackknife ridge, Liu regression, jackknife Liu, Kibria–Lukman, and jackknife Kibria–Lukman. We conducted a simulation study and a real-life analysis with a quantitative structure–activity relationship (QSAR) data with 2540 descriptors after preprocessing. The optimal prediction is determined using the cross-validation error. The estimator with minimum cross-validation error is considered best. It is obvious that performing jackknife estimation after random forest selection is preferred.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"38 1","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemometrics","FirstCategoryId":"92","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cem.3522","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL WORK","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we developed a simple technique for effective parameter estimation and prediction of the quantitative structure activity relationship studies using a two-step procedure. The first step is to choose the important molecular descriptors using the random forest regression, and the second step is to optimally predict the biological activity of the selected chemical compounds using the following estimators: ridge regression, jackknife ridge, Liu regression, jackknife Liu, Kibria–Lukman, and jackknife Kibria–Lukman. We conducted a simulation study and a real-life analysis with a quantitative structure–activity relationship (QSAR) data with 2540 descriptors after preprocessing. The optimal prediction is determined using the cross-validation error. The estimator with minimum cross-validation error is considered best. It is obvious that performing jackknife estimation after random forest selection is preferred.
期刊介绍:
The Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics. It also provides a forum for the exchange of information on meetings and other news relevant to the growing community of scientists who are interested in chemometrics and its applications. Short, critical review papers are a particularly important feature of the journal, in view of the multidisciplinary readership at which it is aimed.