How Flawed is ECE? An Analysis via Logit Smoothing

ArXiv Pub Date : 2024-02-15 DOI:10.48550/arXiv.2402.10046

Muthu Chidambaram, Holden Lee, Colin McSwiggen, Semon Rezchikov

引用次数: 0

Abstract

Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.

查看原文本刊更多论文

欧洲经委会有多大缺陷？对数平滑分析

非正式地讲，如果模型预测正确的概率与预测的置信度相匹配，那么该模型就是经过校准的。迄今为止，文献中最常用的校准测量方法是预期校准误差（ECE）。然而，最近的研究指出了 ECE 的缺点，例如它在预测因子空间中是不连续的。在这项工作中，我们要问：这些问题有多根本，它们对现有结果有什么影响？为此，我们完全描述了 ECE 在波兰空间上的一般概率度量的不连续性。然后，我们利用这些不连续性的性质，提出了一种新颖的连续、易于估计的误判度量，我们称之为 Logit 平滑 ECE (LS-ECE)。通过比较预先训练好的图像分类模型的 ECE 和 LS-ECE，我们在初步实验中发现，二进制 ECE 与 LS-ECE 非常接近，这表明 ECE 的理论缺陷在实践中是可以避免的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量