Guus Rongen, Gabriela F. Nane, Oswaldo Morales-Napoles, Roger M. Cooke
{"title":"Continuous Distributions and Measures of Statistical Accuracy for Structured Expert Judgment","authors":"Guus Rongen, Gabriela F. Nane, Oswaldo Morales-Napoles, Roger M. Cooke","doi":"10.1002/ffo2.70009","DOIUrl":null,"url":null,"abstract":"<p>This study evaluates five scoring rules, or measures of statistical accuracy, for assessing uncertainty estimates from expert judgment studies and model forecasts. These rules — the Continuously Ranked Probability Score (<span></span><math>\n <semantics>\n <mrow>\n <mi>CRPS</mi>\n </mrow>\n <annotation> ${CRPS}$</annotation>\n </semantics></math>), Kolmogorov-Smirnov (<span></span><math>\n <semantics>\n <mrow>\n <mi>KS</mi>\n </mrow>\n <annotation> ${KS}$</annotation>\n </semantics></math>), Cramer-von Mises (<span></span><math>\n <semantics>\n <mrow>\n <mi>CvM</mi>\n </mrow>\n <annotation> ${CvM}$</annotation>\n </semantics></math>), Anderson Darling (<span></span><math>\n <semantics>\n <mrow>\n <mi>AD</mi>\n </mrow>\n <annotation> ${AD}$</annotation>\n </semantics></math>), and chi-square test — were applied to 6864 expert uncertainty estimates from 49 Classical Model (CM) studies. We compared their sensitivity to various biases and their ability to serve as performance-based weight for expert estimates. Additionally, the piecewise uniform and Metalog distribution were evaluated for their representation of expert estimates because four of the five rules require interpolating the experts' estimates. Simulating biased estimates reveals varying sensitivity of the considered test statistics to these biases. Expert weights derived using one measure of statistical accuracy were evaluated with other measures to assess their performance. The main conclusions are (1) <span></span><math>\n <semantics>\n <mrow>\n <mi>CRPS</mi>\n </mrow>\n <annotation> ${CRPS}$</annotation>\n </semantics></math> overlooks important biases, while chi-square and <span></span><math>\n <semantics>\n <mrow>\n <mi>AD</mi>\n </mrow>\n <annotation> ${AD}$</annotation>\n </semantics></math> behave similarly, as do <span></span><math>\n <semantics>\n <mrow>\n <mi>KS</mi>\n </mrow>\n <annotation> ${KS}$</annotation>\n </semantics></math> and <span></span><math>\n <semantics>\n <mrow>\n <mi>CvM</mi>\n </mrow>\n <annotation> ${CvM}$</annotation>\n </semantics></math>. (2) All measures except <span></span><math>\n <semantics>\n <mrow>\n <mi>CRPS</mi>\n </mrow>\n <annotation> ${CRPS}$</annotation>\n </semantics></math> agree that performance weighting is superior to equal weighting with respect to statistical accuracy. (3) Neither distributions can effectively predict the position of a removed quantile estimate. These insights show the behavior of different scoring rules for combining uncertainty estimates from expert or models, and extent the knowledge for best-practices.</p>","PeriodicalId":100567,"journal":{"name":"FUTURES & FORESIGHT SCIENCE","volume":"7 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ffo2.70009","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"FUTURES & FORESIGHT SCIENCE","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ffo2.70009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study evaluates five scoring rules, or measures of statistical accuracy, for assessing uncertainty estimates from expert judgment studies and model forecasts. These rules — the Continuously Ranked Probability Score (), Kolmogorov-Smirnov (), Cramer-von Mises (), Anderson Darling (), and chi-square test — were applied to 6864 expert uncertainty estimates from 49 Classical Model (CM) studies. We compared their sensitivity to various biases and their ability to serve as performance-based weight for expert estimates. Additionally, the piecewise uniform and Metalog distribution were evaluated for their representation of expert estimates because four of the five rules require interpolating the experts' estimates. Simulating biased estimates reveals varying sensitivity of the considered test statistics to these biases. Expert weights derived using one measure of statistical accuracy were evaluated with other measures to assess their performance. The main conclusions are (1) overlooks important biases, while chi-square and behave similarly, as do and . (2) All measures except agree that performance weighting is superior to equal weighting with respect to statistical accuracy. (3) Neither distributions can effectively predict the position of a removed quantile estimate. These insights show the behavior of different scoring rules for combining uncertainty estimates from expert or models, and extent the knowledge for best-practices.