{"title":"Programming Languages Techniques for Controlling Generalization Errors in Adaptive Data Analysis","authors":"Marco Gaboardi","doi":"10.1145/3479394.3479395","DOIUrl":null,"url":null,"abstract":"Data analysts aim at guaranteeing that the result of a data analysis run on sample data does not differ too much from the result one would achieve by running the analysis over the entire population. To achieve this goal, they have developed several techniques to control the generalization errors of their data analyses. In this talk, I will discuss how programming language techniques can help data analysts to design adaptive data analyses with low generalization error. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. When queries are arbitrarily composed, the different errors can propagate through the chain of different queries and bring high generalization errors. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. In my talk, I will first present a programming model for adaptive data analyses based on a simple imperative programming language that is suitable to integrate different techniques that can be used for controlling the generalization error. I will then introduce a program analysis for this language that, given an input program implementing an adaptive data analysis, generates an upper bound on the total number of queries that the data analysis will run, and more interestingly also an upper bound on the depth of the chain of queries implemented by the input program. These two measures Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). PPDP 2021, September 6–8, 2021, Tallinn, Estonia © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8689-0/21/09. https://doi.org/10.1145/3479394.3479395 can be used to select the right technique to guarantee a bound on the generalization error of the input data analysis. I will also discuss how such program analysis could also be potentially extended to higher order functional programs. I will then discuss limitations and potential future works.","PeriodicalId":242361,"journal":{"name":"23rd International Symposium on Principles and Practice of Declarative Programming","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"23rd International Symposium on Principles and Practice of Declarative Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3479394.3479395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
自适应数据分析中控制泛化误差的编程语言技术
数据分析师的目的是保证在样本数据上运行的数据分析结果与在整个人口上运行分析所获得的结果不会有太大的差异。为了实现这一目标,他们开发了几种技术来控制数据分析的泛化误差。在这次演讲中,我将讨论编程语言技术如何帮助数据分析师设计具有低泛化误差的自适应数据分析。自适应数据分析可以看作是由查询某些数据的多个查询组成的过程,其中选择下一步运行哪个查询可能依赖于前一个查询的结果。当查询是任意组合的时候,不同的错误会在不同的查询链中传播,带来很高的泛化误差。为了解决这个问题,数据分析人员正在设计几种技术,这些技术不仅保证单个查询的泛化错误有界限,而且还保证组合分析的泛化错误有界限。在我的演讲中,我将首先提出一个基于简单命令式编程语言的自适应数据分析编程模型,该模型适用于集成可用于控制泛化误差的不同技术。然后,我将为这种语言介绍一个程序分析,给定一个实现自适应数据分析的输入程序,它会生成数据分析将运行的查询总数的上限,更有趣的是,它还会生成由输入程序实现的查询链深度的上限。这两项措施允许免费制作部分或全部作品的数字或硬拷贝供个人或课堂使用,前提是副本不是为了盈利或商业利益而制作或分发的,并且副本在第一页上带有本通知和完整的引用。本作品的第三方组件的版权必须得到尊重。对于所有其他用途,请联系所有者/作者。PPDP 2021, 2021年9月6日至8日,爱沙尼亚塔林©2021版权归所有人/作者所有。Acm isbn 978-1-4503-8689-0/21/09。https://doi.org/10.1145/3479394.3479395可以用来选择正确的技术,以保证输入数据分析的泛化误差有一个界限。我还将讨论如何将这种程序分析潜在地扩展到高阶函数程序。然后我将讨论局限性和潜在的未来作品。
本文章由计算机程序翻译,如有差异,请以英文原文为准。