Paul Horton, Alexandru Florea, Brandon Stringfield
{"title":"Conformal validation: A deferral policy using uncertainty quantification with a human-in-the-loop for model validation","authors":"Paul Horton, Alexandru Florea, Brandon Stringfield","doi":"10.1016/j.mlwa.2025.100733","DOIUrl":null,"url":null,"abstract":"<div><div>Validating performance is a key challenge facing the adoption of machine learning models in high risk applications. Current validation methods assess performance marginally over the entire testing dataset, which can fail to identify regions in the distribution with insufficient performance. In this paper, we propose Conformal Validation, a systems-based approach with a calibrated form of uncertainty quantification using a conformal prediction framework as a part of the validation process to reduce performance gaps. Specifically, the policy defers a subset of observations for which the predictive model is most uncertain and provides a human with informative prediction sets to make the ancillary decision. We evaluate this policy on an image classification task where images are distorted with varying levels of gaussian blur for a quantifiable measure of added difficulty. The model is compared to human performance on the most difficult observations, i.e., those where the model is most uncertain, to simulate the scenario when a human is the alternative decision-maker. We evaluate performance on three arms: the model independently, humans with access to a set of classes the model is most confident in, and humans independently. The deferral policy is simple to understand, applicable to any predictive model, and easy to implement while, in this case, keeping humans in the loop for improved trustworthiness. Conformal Validation incorporates a risk assessment that is conditioned on the prediction set length and can be tuned to the needs of the application.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"22 ","pages":"Article 100733"},"PeriodicalIF":4.9000,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025001161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Validating performance is a key challenge facing the adoption of machine learning models in high risk applications. Current validation methods assess performance marginally over the entire testing dataset, which can fail to identify regions in the distribution with insufficient performance. In this paper, we propose Conformal Validation, a systems-based approach with a calibrated form of uncertainty quantification using a conformal prediction framework as a part of the validation process to reduce performance gaps. Specifically, the policy defers a subset of observations for which the predictive model is most uncertain and provides a human with informative prediction sets to make the ancillary decision. We evaluate this policy on an image classification task where images are distorted with varying levels of gaussian blur for a quantifiable measure of added difficulty. The model is compared to human performance on the most difficult observations, i.e., those where the model is most uncertain, to simulate the scenario when a human is the alternative decision-maker. We evaluate performance on three arms: the model independently, humans with access to a set of classes the model is most confident in, and humans independently. The deferral policy is simple to understand, applicable to any predictive model, and easy to implement while, in this case, keeping humans in the loop for improved trustworthiness. Conformal Validation incorporates a risk assessment that is conditioned on the prediction set length and can be tuned to the needs of the application.