Addressing the Binning Problem in Calibration Assessment through Scalar Annotations

Transactions of the Association for Computational Linguistics Pub Date : 2024-02-01 DOI:10.1162/tacl_a_00636

Zhengping Jiang, Anqi Liu, Benjamnin Van Durme

引用次数: 0

Abstract

Abstract Computational linguistics models commonly target the prediction of discrete—categorical—labels. When assessing how well-calibrated these model predictions are, popular evaluation schemes require practitioners to manually determine a binning scheme: grouping labels into bins to approximate true label posterior. The problem is that these metrics are sensitive to binning decisions. We consider two solutions to the binning problem that apply at the stage of data annotation: collecting either distributed (redundant) labels or direct scalar value assignment. In this paper, we show that although both approaches address the binning problem by evaluating instance-level calibration, direct scalar assignment is significantly more cost-effective. We provide theoretical analysis and empirical evidence to support our proposal for dataset creators to adopt scalar annotation protocols to enable a higher-quality assessment of model calibration.

查看原文本刊更多论文

通过标量注释解决校准评估中的分选问题

摘要计算语言学模型通常以离散分类标签预测为目标。在评估这些模型预测的校准程度时，流行的评估方案要求从业人员手动确定分选方案：将标签分组，以近似真实标签后验。问题在于，这些指标对分选决策非常敏感。我们考虑了两种适用于数据注释阶段的分档问题解决方案：收集分布式（冗余）标签或直接标量值赋值。在本文中，我们表明尽管这两种方法都是通过评估实例级校准来解决分选问题，但直接标量赋值的成本效益要高得多。我们提供了理论分析和经验证据，以支持我们的建议，即数据集创建者应采用标量注释协议，以便对模型校准进行更高质量的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transactions of the Association for Computational Linguistics

自引率

0.00%

发文量