Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第7页

Towards scalable critical alert mining 迈向可扩展的关键警报挖掘

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623729

Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, H. Çam, Jiawei Han, Xifeng Yan

{"title":"Towards scalable critical alert mining","authors":"Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, H. Çam, Jiawei Han, Xifeng Yan","doi":"10.1145/2623330.2623729","DOIUrl":"https://doi.org/10.1145/2623330.2623729","url":null,"abstract":"Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79931152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Leveraging user libraries to bootstrap collaborative filtering 利用用户库引导协同过滤

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623663

Laurent Charlin, R. Zemel, H. Larochelle

引用次数: 9

A hazard based approach to user return time prediction 基于危险的用户返回时间预测方法

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623348

Komal Kapoor, Mingxuan Sun, J. Srivastava, Tao Ye

{"title":"A hazard based approach to user return time prediction","authors":"Komal Kapoor, Mingxuan Sun, J. Srivastava, Tao Ye","doi":"10.1145/2623330.2623348","DOIUrl":"https://doi.org/10.1145/2623330.2623348","url":null,"abstract":"In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good which makes it crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. We further apply predictive analysis to the proposed retention metric on a service, as a means for characterizing lost customers. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to be superior.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82703924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

A dirichlet multinomial mixture model-based approach for short text clustering 基于dirichlet多项混合模型的短文本聚类方法

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623715

Jianhua Yin, Jianyong Wang

引用次数: 443

Frontiers in E-commerce personalization 电子商务个性化的前沿

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2630820

S. Subramaniam

引用次数: 6

Improved testing of low rank matrices 改进了低秩矩阵的测试

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623736

Yi Li, Zhengyu Wang, David P. Woodruff

{"title":"Improved testing of low rank matrices","authors":"Yi Li, Zhengyu Wang, David P. Woodruff","doi":"10.1145/2623330.2623736","DOIUrl":"https://doi.org/10.1145/2623330.2623736","url":null,"abstract":"We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. Specifically, we study the problem of quickly estimating the rank or stable rank of A, the latter often providing a more robust measure of the rank. Since we seek significantly sublinear time algorithms, we cast these problems in the property testing framework. In this framework, A either has low rank or stable rank, or is far from having this property. The algorithm should read only a small number of entries or rows of A and decide which case A is in with high probability. If neither case occurs, the output is allowed to be arbitrary. We consider two notions of being far: (1) A requires changing at least an ε-fraction of its entries, or (2) A requires changing at least an ε-fraction of its rows. We call the former the \"entry model\" and the latter the \"row model\". We show: For testing if a matrix has rank at most d in the entry model, we improve the previous number of entries of A that need to be read from O(d2/ε2) (Krauthgamer and Sasson, SODA 2003) to O(d2/ε). Our algorithm is the first to adaptively query the entries of A, which for constant d we show is necessary to achieve O(1/ε) queries. For the important case of d = 1 we also give a new non-adaptive algorithm, improving the previous O(1/ε2) queries to O(log2(1/ε) / ε). For testing if a matrix has rank at most d in the row model, we prove an Ω(d/ε) lower bound on the number of rows that need to be read, even for adaptive algorithms. Our lower bound matches a non-adaptive upper bound of Krauthgamer and Sasson. For testing if a matrix has stable rank at most d in the row model or requires changing an ε/d-fraction of its rows in order to have stable rank at most d, we prove that reading θ(d/ε2) rows is necessary and sufficient. We also give an empirical evaluation of our rank and stable rank algorithms on real and synthetic datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83926533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Scaling out big data missing value imputations: pythia vs. godzilla 扩展大数据缺失价值估算:皮提亚vs哥斯拉

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623615

C. Anagnostopoulos, P. Triantafillou

{"title":"Scaling out big data missing value imputations: pythia vs. godzilla","authors":"C. Anagnostopoulos, P. Triantafillou","doi":"10.1145/2623330.2623615","DOIUrl":"https://doi.org/10.1145/2623330.2623615","url":null,"abstract":"Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85916114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Identifying tourists from public transport commuters 从公共交通通勤者中识别游客

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623352

Mingqiang Xue, Huayu Wu, Wei Chen, W. Ng, Gin Howe Goh

{"title":"Identifying tourists from public transport commuters","authors":"Mingqiang Xue, Huayu Wu, Wei Chen, W. Ng, Gin Howe Goh","doi":"10.1145/2623330.2623352","DOIUrl":"https://doi.org/10.1145/2623330.2623352","url":null,"abstract":"Tourism industry has become a key economic driver for Singapore. Understanding the behaviors of tourists is very important for the government and private sectors, e.g., restaurants, hotels and advertising companies, to improve their existing services or create new business opportunities. In this joint work with Singapore's Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. On successful identification, the travelling patterns of tourists are then revealed and thus allow further analyses to be carried out such as on their favorite destinations, region of stay, etc. Technically, we model the tourists identification as a classification problem, and design an iterative learning algorithm to perform inference with limited prior knowledge and labeled data. We show the superiority of our algorithm with performance evaluation and comparison with other state-of-the-art learning algorithms. Further, we build an interactive web-based system for answering queries regarding the moving patterns of the tourists, which can be used by stakeholders to gain insight into tourists' travelling behaviors in Singapore.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86021061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Relevant overlapping subspace clusters on categorical data 分类数据的相关重叠子空间聚类

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623652

Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant

{"title":"Relevant overlapping subspace clusters on categorical data","authors":"Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant","doi":"10.1145/2623330.2623652","DOIUrl":"https://doi.org/10.1145/2623330.2623652","url":null,"abstract":"Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85718260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) 联合建模面向电影推荐的方面、评级和情感(JMARS)

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623758

Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alex Smola, Jing Jiang, Chong Wang

{"title":"Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS)","authors":"Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alex Smola, Jing Jiang, Chong Wang","doi":"10.1145/2623330.2623758","DOIUrl":"https://doi.org/10.1145/2623330.2623758","url":null,"abstract":"Recommendation and review sites offer a wealth of information beyond ratings. For instance, on IMDb users leave reviews, commenting on different aspects of a movie (e.g. actors, plot, visual effects), and expressing their sentiments (positive or negative) on these aspects in their reviews. This suggests that uncovering aspects and sentiments will allow us to gain a better understanding of users, movies, and the process involved in generating ratings. The ability to answer questions such as \"Does this user care more about the plot or about the special effects?\" or \"What is the quality of the movie in terms of acting?\" helps us to understand why certain ratings are generated. This can be used to provide more meaningful recommendations. In this work we propose a probabilistic model based on collaborative filtering and topic modeling. It allows us to capture the interest distribution of users and the content distribution for movies; it provides a link between interest and relevance on a per-aspect basis and it allows us to differentiate between positive and negative sentiments on a per-aspect basis. Unlike prior work our approach is entirely unsupervised and does not require knowledge of the aspect specific ratings or genres for inference. We evaluate our model on a live copy crawled from IMDb. Our model offers superior performance by joint modeling. Moreover, we are able to address the cold start problem -- by utilizing the information inherent in reviews our model demonstrates improvement for new users and movies.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 16","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91418298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 425