The Cascading Analysts Algorithm

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI:10.1145/3183713.3183745

M. Ruhl, Mukund Sundararajan, Qiqi Yan

{"title":"The Cascading Analysts Algorithm","authors":"M. Ruhl, Mukund Sundararajan, Qiqi Yan","doi":"10.1145/3183713.3183745","DOIUrl":null,"url":null,"abstract":"We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3183745","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for a majority of the change. An organization interested in improving the metric can then focus their attention on these data segments. Our key contribution is an algorithm that naturally mimics the operation of a hierarchical organization of analysts. The algorithm has been successfully applied within Google's ad platform (AdWords) to help Google's advertisers triage the performance of their advertising campaigns, and within Google Analytics to help website developers understand their traffic. We empirically analyze the runtime and quality of the algorithm by comparing it against benchmarks for a census dataset. We prove theoretical, worst-case bounds on the performance of the algorithm. For instance, we show that the algorithm is optimal for two dimensions, and has an approximation ratio log d-2 (n+1) for d ≥ 3 dimensions, where n is the number of input data segments. For the advertising application, we can show that our algorithm is a 2-approximation. To characterize the hardness of the problem, we study data patterns called conflicts These allow us to construct hard instances of the problem, and derive a lower bound of 1.144 d-2 (again d ≥3) for our algorithm, and to show that the problem is NP-hard; this justifies are focus on approximation.

查看原文本刊更多论文

级联分析算法

我们研究在树的笛卡尔积上定义的度量的变化。这样的指标在许多实际应用中很自然地出现，其中全局指标(如收入)可以沿着几个层次维度(如位置、性别等)分解。给定这样一个度量的变化，我们的目标是识别一小组非重叠的数据段，这些数据段占了变化的大部分。对改进度量感兴趣的组织可以将注意力集中在这些数据段上。我们的主要贡献是一种算法，它自然地模仿了分析师分层组织的操作。该算法已成功应用于b谷歌的广告平台(AdWords)，以帮助谷歌的广告商对广告活动的表现进行分类，并在谷歌分析中帮助网站开发人员了解他们的流量。我们通过将算法与人口普查数据集的基准进行比较，实证地分析了算法的运行时间和质量。我们证明了算法性能的理论、最坏情况边界。例如，我们证明该算法对于二维是最优的，并且对于d≥3维具有近似比log d-2 (n+1)，其中n是输入数据段的数量。对于广告应用，我们可以证明我们的算法是一个2近似。为了描述问题的困难程度，我们研究了称为冲突的数据模式，这些模式允许我们构建问题的困难实例，并为我们的算法推导出1.144 d-2(再次d≥3)的下界，并表明问题是np困难的;这证明了我们专注于近似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 International Conference on Management of Data

自引率

0.00%

发文量