随机精度估计及其在预测挑战下分类模型预测质量验证中的应用

IF 0.7 4区 化学 Q4 CHEMISTRY, MULTIDISCIPLINARY
B. Lučić, Jadranko Batista, V. Bojović, M. Lovrić, Ana Sovic Krzic, D. Bešlo, Damir Nadramija, D. Vikić-Topić
{"title":"随机精度估计及其在预测挑战下分类模型预测质量验证中的应用","authors":"B. Lučić, Jadranko Batista, V. Bojović, M. Lovrić, Ana Sovic Krzic, D. Bešlo, Damir Nadramija, D. Vikić-Topić","doi":"10.5562/cca3551","DOIUrl":null,"url":null,"abstract":"Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.","PeriodicalId":10822,"journal":{"name":"Croatica Chemica Acta","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.5562/cca3551","citationCount":"25","resultStr":"{\"title\":\"Estimation of Random Accuracy and its Use in Validation of Predictive Quality of Classification Models within Predictive Challenges\",\"authors\":\"B. Lučić, Jadranko Batista, V. Bojović, M. Lovrić, Ana Sovic Krzic, D. Bešlo, Damir Nadramija, D. Vikić-Topić\",\"doi\":\"10.5562/cca3551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.\",\"PeriodicalId\":10822,\"journal\":{\"name\":\"Croatica Chemica Acta\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2019-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.5562/cca3551\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Croatica Chemica Acta\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.5562/cca3551\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Croatica Chemica Acta","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.5562/cca3551","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 25

摘要

分析了相关系数(Pearson’s)作为估计和计算预测模型特性准确性的度量的缺点。在这里,我们讨论了在应用该模型预测一组新的外部化合物的性质时经常出现的两种情况。使用相关系数的第一个问题是它对预测新的外部化合物集合的性质时必须预期的系统误差不敏感,该集合不是从训练集中选择的随机样本。第二个问题是,外部集合可以是任意大或小的,并且具有目标变量的测量值的任意且不均匀的分布,其值事先是未知的。在这些条件下,相关系数可能是预测值与相应实验值一致性的过于乐观的度量,并可能导致对模型的预测能力的高度乐观的结论。由于相关系数的这些缺点,建议使用预测的标准误差(均方根误差)作为模型预测能力的更好质量度量。在分类模型的情况下,使用模型的真实准确度和最可能的随机准确度之间的差异,在根据预测质量对不同模型进行排序时显示出非常好的特性,同时具有明显的解释。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Estimation of Random Accuracy and its Use in Validation of Predictive Quality of Classification Models within Predictive Challenges
Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Croatica Chemica Acta
Croatica Chemica Acta 化学-化学综合
CiteScore
0.60
自引率
0.00%
发文量
3
审稿时长
18 months
期刊介绍: Croatica Chemica Acta (Croat. Chem. Acta, CCA), is an international journal of the Croatian Chemical Society publishing scientific articles of general interest to chemistry.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信