Does calibration mean what they say it means; or, the reference class problem rises again

IF 1.1 1区哲学 0 PHILOSOPHY

PHILOSOPHICAL STUDIES Pub Date : 2025-05-05 DOI:10.1007/s11098-025-02322-y

Lily Hu

{"title":"Does calibration mean what they say it means; or, the reference class problem rises again","authors":"Lily Hu","doi":"10.1007/s11098-025-02322-y","DOIUrl":null,"url":null,"abstract":"Discussions of statistical criteria for fairness commonly convey the normative significance of calibration within groups by invoking what risk scores “mean.” On the Same Meaning picture, group-calibrated scores “mean the same thing” (on average) across individuals from different groups and accordingly, guard against disparate treatment of individuals based on group membership. My contention is that calibration guarantees no such thing. Since concrete actual people belong to many groups, calibration cannot ensure the kind of consistent score interpretation that the Same Meaning picture implies matters for fairness, unless calibration is met within every group to which an individual belongs. Alas only perfect predictors may meet this bar. The Same Meaning picture thus commits a reference class fallacy by inferring from calibration within some group to the “meaning” or evidential value of an individual’s score, because they are a member of that group. The reference class answer it presumes does not only lack justification; it is very likely wrong. I then show that the reference class problem besets not just calibration but other group statistical criteria that claim a close connection to fairness. Reflecting on the origins of this oversight opens a wider lens onto the predominant methodology in algorithmic fairness based on stylized cases.","PeriodicalId":48305,"journal":{"name":"PHILOSOPHICAL STUDIES","volume":"56 1","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PHILOSOPHICAL STUDIES","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11098-025-02322-y","RegionNum":1,"RegionCategory":"哲学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"PHILOSOPHY","Score":null,"Total":0}

引用次数: 0

Abstract

Discussions of statistical criteria for fairness commonly convey the normative significance of calibration within groups by invoking what risk scores “mean.” On the Same Meaning picture, group-calibrated scores “mean the same thing” (on average) across individuals from different groups and accordingly, guard against disparate treatment of individuals based on group membership. My contention is that calibration guarantees no such thing. Since concrete actual people belong to many groups, calibration cannot ensure the kind of consistent score interpretation that the Same Meaning picture implies matters for fairness, unless calibration is met within every group to which an individual belongs. Alas only perfect predictors may meet this bar. The Same Meaning picture thus commits a reference class fallacy by inferring from calibration within some group to the “meaning” or evidential value of an individual’s score, because they are a member of that group. The reference class answer it presumes does not only lack justification; it is very likely wrong. I then show that the reference class problem besets not just calibration but other group statistical criteria that claim a close connection to fairness. Reflecting on the origins of this oversight opens a wider lens onto the predominant methodology in algorithmic fairness based on stylized cases.

查看原文本刊更多论文

校准的意思和他们说的一样吗？或者，引用类问题再次出现

关于公平的统计标准的讨论通常通过引用风险评分的“含义”来传达群体内校准的规范性意义。在“意义相同”这张图中，来自不同群体的个体的群体校准分数（平均而言）“意味着同样的事情”，因此，要防止基于群体成员身份对个体的差别对待。我的论点是，校准不能保证这样的事情。由于具体的实际人群属于许多群体，除非在个人所属的每个群体中都满足校准，否则校准不能确保相同含义图片所暗示的那种一致的得分解释对公平性很重要。唉，只有完美的预测者才能达到这个标准。因此，“意义相同”的观点犯了一个参照类谬误，它从某一群体内的校准推断出个人得分的“意义”或证据价值，因为他们是该群体的一员。它所假定的参考类答案不仅缺乏正当性；这很可能是错的。然后，我表明，参考类问题不仅困扰校准，还困扰其他声称与公平性密切相关的群体统计标准。反思这种疏忽的起源打开了一个更广泛的镜头，以程式化案例为基础的算法公平的主要方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PHILOSOPHICAL STUDIES PHILOSOPHY-

CiteScore

2.60

自引率

7.70%

发文量

127

期刊介绍： Philosophical Studies was founded in 1950 by Herbert Feigl and Wilfrid Sellars to provide a periodical dedicated to work in analytic philosophy. The journal remains devoted to the publication of papers in exclusively analytic philosophy. Papers applying formal techniques to philosophical problems are welcome. The principal aim is to publish articles that are models of clarity and precision in dealing with significant philosophical issues. It is intended that readers of the journal will be kept abreast of the central issues and problems of contemporary analytic philosophy. Double-blind review procedure The journal follows a double-blind reviewing procedure. Authors are therefore requested to place their name and affiliation on a separate page. Self-identifying citations and references in the article text should either be avoided or left blank when manuscripts are first submitted. Authors are responsible for reinserting self-identifying citations and references when manuscripts are prepared for final submission.