{"title":"自适应数据分析中控制泛化误差的编程语言技术","authors":"Marco Gaboardi","doi":"10.1145/3479394.3479395","DOIUrl":null,"url":null,"abstract":"Data analysts aim at guaranteeing that the result of a data analysis run on sample data does not differ too much from the result one would achieve by running the analysis over the entire population. To achieve this goal, they have developed several techniques to control the generalization errors of their data analyses. In this talk, I will discuss how programming language techniques can help data analysts to design adaptive data analyses with low generalization error. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. When queries are arbitrarily composed, the different errors can propagate through the chain of different queries and bring high generalization errors. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. In my talk, I will first present a programming model for adaptive data analyses based on a simple imperative programming language that is suitable to integrate different techniques that can be used for controlling the generalization error. I will then introduce a program analysis for this language that, given an input program implementing an adaptive data analysis, generates an upper bound on the total number of queries that the data analysis will run, and more interestingly also an upper bound on the depth of the chain of queries implemented by the input program. These two measures Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). PPDP 2021, September 6–8, 2021, Tallinn, Estonia © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8689-0/21/09. https://doi.org/10.1145/3479394.3479395 can be used to select the right technique to guarantee a bound on the generalization error of the input data analysis. I will also discuss how such program analysis could also be potentially extended to higher order functional programs. I will then discuss limitations and potential future works.","PeriodicalId":242361,"journal":{"name":"23rd International Symposium on Principles and Practice of Declarative Programming","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Programming Languages Techniques for Controlling Generalization Errors in Adaptive Data Analysis\",\"authors\":\"Marco Gaboardi\",\"doi\":\"10.1145/3479394.3479395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data analysts aim at guaranteeing that the result of a data analysis run on sample data does not differ too much from the result one would achieve by running the analysis over the entire population. To achieve this goal, they have developed several techniques to control the generalization errors of their data analyses. In this talk, I will discuss how programming language techniques can help data analysts to design adaptive data analyses with low generalization error. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. When queries are arbitrarily composed, the different errors can propagate through the chain of different queries and bring high generalization errors. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. In my talk, I will first present a programming model for adaptive data analyses based on a simple imperative programming language that is suitable to integrate different techniques that can be used for controlling the generalization error. I will then introduce a program analysis for this language that, given an input program implementing an adaptive data analysis, generates an upper bound on the total number of queries that the data analysis will run, and more interestingly also an upper bound on the depth of the chain of queries implemented by the input program. These two measures Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). PPDP 2021, September 6–8, 2021, Tallinn, Estonia © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8689-0/21/09. https://doi.org/10.1145/3479394.3479395 can be used to select the right technique to guarantee a bound on the generalization error of the input data analysis. I will also discuss how such program analysis could also be potentially extended to higher order functional programs. I will then discuss limitations and potential future works.\",\"PeriodicalId\":242361,\"journal\":{\"name\":\"23rd International Symposium on Principles and Practice of Declarative Programming\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"23rd International Symposium on Principles and Practice of Declarative Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3479394.3479395\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"23rd International Symposium on Principles and Practice of Declarative Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3479394.3479395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Programming Languages Techniques for Controlling Generalization Errors in Adaptive Data Analysis
Data analysts aim at guaranteeing that the result of a data analysis run on sample data does not differ too much from the result one would achieve by running the analysis over the entire population. To achieve this goal, they have developed several techniques to control the generalization errors of their data analyses. In this talk, I will discuss how programming language techniques can help data analysts to design adaptive data analyses with low generalization error. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. When queries are arbitrarily composed, the different errors can propagate through the chain of different queries and bring high generalization errors. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. In my talk, I will first present a programming model for adaptive data analyses based on a simple imperative programming language that is suitable to integrate different techniques that can be used for controlling the generalization error. I will then introduce a program analysis for this language that, given an input program implementing an adaptive data analysis, generates an upper bound on the total number of queries that the data analysis will run, and more interestingly also an upper bound on the depth of the chain of queries implemented by the input program. These two measures Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). PPDP 2021, September 6–8, 2021, Tallinn, Estonia © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8689-0/21/09. https://doi.org/10.1145/3479394.3479395 can be used to select the right technique to guarantee a bound on the generalization error of the input data analysis. I will also discuss how such program analysis could also be potentially extended to higher order functional programs. I will then discuss limitations and potential future works.