简化模型之路从噪音开始

Advances in neural information processing systems Pub Date : 2023-12-01

Lesia Semenova, Harry Chen, Ronald Parr, Cynthia Rudin

{"title":"简化模型之路从噪音开始","authors":"Lesia Semenova, Harry Chen, Ronald Parr, Cynthia Rudin","doi":"","DOIUrl":null,"url":null,"abstract":"The Rashomon set is the set of models that perform approximately equally well on a given dataset, and the Rashomon ratio is the fraction of all models in a given hypothesis space that are in the Rashomon set. Rashomon ratios are often large for tabular datasets in criminal justice, healthcare, lending, education, and in other areas, which has practical implications about whether simpler models can attain the same level of accuracy as more complex models. An open question is why Rashomon ratios often tend to be large. In this work, we propose and study a mechanism of the data generation process, coupled with choices usually made by the analyst during the learning process, that determines the size of the Rashomon ratio. Specifically, we demonstrate that noisier datasets lead to larger Rashomon ratios through the way that practitioners train models. Additionally, we introduce a measure called pattern diversity, which captures the average difference in predictions between distinct classification patterns in the Rashomon set, and motivate why it tends to increase with label noise. Our results explain a key aspect of why simpler models often tend to perform as well as black box models on complex, noisier datasets.","PeriodicalId":72099,"journal":{"name":"Advances in neural information processing systems","volume":"36 ","pages":"3362-3401"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10993912/pdf/","citationCount":"0","resultStr":"{\"title\":\"A Path to Simpler Models Starts With Noise.\",\"authors\":\"Lesia Semenova, Harry Chen, Ronald Parr, Cynthia Rudin\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Rashomon set is the set of models that perform approximately equally well on a given dataset, and the Rashomon ratio is the fraction of all models in a given hypothesis space that are in the Rashomon set. Rashomon ratios are often large for tabular datasets in criminal justice, healthcare, lending, education, and in other areas, which has practical implications about whether simpler models can attain the same level of accuracy as more complex models. An open question is why Rashomon ratios often tend to be large. In this work, we propose and study a mechanism of the data generation process, coupled with choices usually made by the analyst during the learning process, that determines the size of the Rashomon ratio. Specifically, we demonstrate that noisier datasets lead to larger Rashomon ratios through the way that practitioners train models. Additionally, we introduce a measure called pattern diversity, which captures the average difference in predictions between distinct classification patterns in the Rashomon set, and motivate why it tends to increase with label noise. Our results explain a key aspect of why simpler models often tend to perform as well as black box models on complex, noisier datasets.\",\"PeriodicalId\":72099,\"journal\":{\"name\":\"Advances in neural information processing systems\",\"volume\":\"36 \",\"pages\":\"3362-3401\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10993912/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in neural information processing systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in neural information processing systems","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

罗生门集是在给定数据集上表现大致相同的模型集，罗生门比是给定假设空间中所有模型中属于罗生门集的部分。在刑事司法、医疗保健、借贷、教育和其他领域的表格数据集上，罗生门比率通常很大，这对较简单的模型是否能达到与较复杂模型相同的准确性水平具有实际意义。一个悬而未决的问题是，为什么罗生门比率往往很大？在这项工作中，我们提出并研究了数据生成过程中的一个机制，该机制与分析师在学习过程中通常做出的选择相结合，决定了罗生门比率的大小。具体来说，我们证明，通过从业人员训练模型的方式，噪声较大的数据集会导致较大的罗生门比率。此外，我们还引入了一种称为模式多样性的测量方法，它可以捕捉罗生门集中不同分类模式之间预测结果的平均差异，并解释了为什么它往往会随着标签噪声的增加而增加。我们的结果从一个关键方面解释了为什么在复杂、噪声较大的数据集上，较简单的模型往往能像黑盒模型一样表现出色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

A Path to Simpler Models Starts With Noise.

The Rashomon set is the set of models that perform approximately equally well on a given dataset, and the Rashomon ratio is the fraction of all models in a given hypothesis space that are in the Rashomon set. Rashomon ratios are often large for tabular datasets in criminal justice, healthcare, lending, education, and in other areas, which has practical implications about whether simpler models can attain the same level of accuracy as more complex models. An open question is why Rashomon ratios often tend to be large. In this work, we propose and study a mechanism of the data generation process, coupled with choices usually made by the analyst during the learning process, that determines the size of the Rashomon ratio. Specifically, we demonstrate that noisier datasets lead to larger Rashomon ratios through the way that practitioners train models. Additionally, we introduce a measure called pattern diversity, which captures the average difference in predictions between distinct classification patterns in the Rashomon set, and motivate why it tends to increase with label noise. Our results explain a key aspect of why simpler models often tend to perform as well as black box models on complex, noisier datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in neural information processing systems

自引率

0.00%

发文量