Estimating rates of rare events with multiple hierarchies through scalable log-linear models

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI:10.1145/1835804.1835834

D. Agarwal, Rahul Agrawal, Rajiv Khanna, Nagaraj Kota

{"title":"Estimating rates of rare events with multiple hierarchies through scalable log-linear models","authors":"D. Agarwal, Rahul Agrawal, Rajiv Khanna, Nagaraj Kota","doi":"10.1145/1835804.1835834","DOIUrl":null,"url":null,"abstract":"We consider the problem of estimating rates of rare events for high dimensional, multivariate categorical data where several dimensions are hierarchical. Such problems are routine in several data mining applications including computational advertising, our main focus in this paper. We propose LMMH, a novel log-linear modeling method that scales to massive data applications with billions of training records and several million potential predictors in a map-reduce framework. Our method exploits correlations in aggregates observed at multiple resolutions when working with multiple hierarchies; stable estimates at coarser resolution provide informative prior information to improve estimates at finer resolutions. Other than prediction accuracy and scalability, our method has an inbuilt variable screening procedure based on a \"spike and slab prior\" that provides parsimony by removing non-informative predictors without hurting predictive accuracy. We perform large scale experiments on data from real computational advertising applications and illustrate our approach on datasets with several billion records and hundreds of millions of predictors. Extensive comparisons with other benchmark methods show significant improvements in prediction accuracy.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"81","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1835804.1835834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 81

Abstract

We consider the problem of estimating rates of rare events for high dimensional, multivariate categorical data where several dimensions are hierarchical. Such problems are routine in several data mining applications including computational advertising, our main focus in this paper. We propose LMMH, a novel log-linear modeling method that scales to massive data applications with billions of training records and several million potential predictors in a map-reduce framework. Our method exploits correlations in aggregates observed at multiple resolutions when working with multiple hierarchies; stable estimates at coarser resolution provide informative prior information to improve estimates at finer resolutions. Other than prediction accuracy and scalability, our method has an inbuilt variable screening procedure based on a "spike and slab prior" that provides parsimony by removing non-informative predictors without hurting predictive accuracy. We perform large scale experiments on data from real computational advertising applications and illustrate our approach on datasets with several billion records and hundreds of millions of predictors. Extensive comparisons with other benchmark methods show significant improvements in prediction accuracy.

查看原文本刊更多论文

利用可扩展对数线性模型估计多层次罕见事件的概率

我们考虑的问题，估计罕见事件率的高维，多元分类数据，其中几个维度是分层的。这些问题在一些数据挖掘应用中是常见的，包括计算广告，这是我们本文的主要关注点。我们提出了LMMH，这是一种新颖的对数线性建模方法，可扩展到具有数十亿训练记录和数百万潜在预测器的大规模数据应用。我们的方法利用在处理多个层次结构时在多个分辨率下观察到的聚合中的相关性;在较粗分辨率下的稳定估计提供了丰富的先验信息，以改进在较细分辨率下的估计。除了预测的准确性和可扩展性，我们的方法还有一个内置的基于“峰值和平板先验”的变量筛选过程，通过在不影响预测准确性的情况下删除非信息预测来提供简约性。我们对来自真实计算广告应用程序的数据进行了大规模实验，并在具有数十亿条记录和数亿个预测器的数据集上说明了我们的方法。与其他基准方法的广泛比较表明，预测精度有显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量