Paula Rodriguez-Diaz, Lingkai Kong, Kai Wang, David Alvarez-Melis, Milind Tambe
{"title":"What is the Right Notion of Distance between Predict-then-Optimize Tasks?","authors":"Paula Rodriguez-Diaz, Lingkai Kong, Kai Wang, David Alvarez-Melis, Milind Tambe","doi":"arxiv-2409.06997","DOIUrl":null,"url":null,"abstract":"Comparing datasets is a fundamental task in machine learning, essential for\nvarious learning paradigms; from evaluating train and test datasets for model\ngeneralization to using dataset similarity for detecting data drift. While\ntraditional notions of dataset distances offer principled measures of\nsimilarity, their utility has largely been assessed through prediction error\nminimization. However, in Predict-then-Optimize (PtO) frameworks, where\npredictions serve as inputs for downstream optimization tasks, model\nperformance is measured through decision regret minimization rather than\nprediction error minimization. In this work, we (i) show that traditional\ndataset distances, which rely solely on feature and label dimensions, lack\ninformativeness in the PtO context, and (ii) propose a new dataset distance\nthat incorporates the impacts of downstream decisions. Our results show that\nthis decision-aware dataset distance effectively captures adaptation success in\nPtO contexts, providing a PtO adaptation bound in terms of dataset distance.\nEmpirically, we show that our proposed distance measure accurately predicts\ntransferability across three different PtO tasks from the literature.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06997","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Comparing datasets is a fundamental task in machine learning, essential for
various learning paradigms; from evaluating train and test datasets for model
generalization to using dataset similarity for detecting data drift. While
traditional notions of dataset distances offer principled measures of
similarity, their utility has largely been assessed through prediction error
minimization. However, in Predict-then-Optimize (PtO) frameworks, where
predictions serve as inputs for downstream optimization tasks, model
performance is measured through decision regret minimization rather than
prediction error minimization. In this work, we (i) show that traditional
dataset distances, which rely solely on feature and label dimensions, lack
informativeness in the PtO context, and (ii) propose a new dataset distance
that incorporates the impacts of downstream decisions. Our results show that
this decision-aware dataset distance effectively captures adaptation success in
PtO contexts, providing a PtO adaptation bound in terms of dataset distance.
Empirically, we show that our proposed distance measure accurately predicts
transferability across three different PtO tasks from the literature.