Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献_第6页

Why It Happened: Identifying and Modeling the Reasons of the Happening of Social Events 为什么会发生:社会事件发生的原因识别与建模

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783305

Yu Rong, Hong Cheng, Zhiyu Mo

{"title":"Why It Happened: Identifying and Modeling the Reasons of the Happening of Social Events","authors":"Yu Rong, Hong Cheng, Zhiyu Mo","doi":"10.1145/2783258.2783305","DOIUrl":"https://doi.org/10.1145/2783258.2783305","url":null,"abstract":"In nowadays social networks, a huge volume of content containing rich information, such as reviews, ratings, microblogs, etc., is being generated, consumed and diffused by users all the time. Given the temporal information, we can obtain the event cascade which indicates the time sequence of the arrival of information to users. Many models have been proposed to explain how information diffuses. However, most existing models cannot give a clear explanation why every specific event happens in the event cascade. Such explanation is essential for us to have a deeper understanding of information diffusion as well as a better prediction of future event cascade. In order to uncover the mechanism of the happening of social events, we analyze the rating event data crawled from Douban.com, a Chinese social network, from year 2006 to 2011. We distinguish three factors: social, external and intrinsic influence which can explain the emergence of every specific event. Then we use the mixed Poisson process to model event cascade generated by different factors respectively and integrate different Poisson processes with shared parameters. The proposed model, called Combinational Mixed Poisson Process (CMPP) model, can explain not only how information diffuses in social networks, but also why a specific event happens. This model can help us to understand information diffusion from both macroscopic and microscopic perspectives. We develop an efficient Classification EM algorithm to infer the model parameters. The explanatory and predictive power of the proposed model has been demonstrated by the experiments on large real data sets.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122166747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Reconstructing Textual Documents from n-grams 从n-grams重构文本文档

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783361

Matthias Gallé, Matías Tealdi

{"title":"Reconstructing Textual Documents from n-grams","authors":"Matthias Gallé, Matías Tealdi","doi":"10.1145/2783258.2783361","DOIUrl":"https://doi.org/10.1145/2783258.2783361","url":null,"abstract":"We analyze the problem of reconstructing documents when we only have access to the n-grams (for n fixed) and their counts from the original documents. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph, preserving all possible reconstructions while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125192588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Scalable Blocking for Privacy Preserving Record Linkage 隐私保护记录链接的可伸缩阻塞

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783290

Alexandros Karakasidis, Georgia Koloniari, Vassilios S. Verykios

{"title":"Scalable Blocking for Privacy Preserving Record Linkage","authors":"Alexandros Karakasidis, Georgia Koloniari, Vassilios S. Verykios","doi":"10.1145/2783258.2783290","DOIUrl":"https://doi.org/10.1145/2783258.2783290","url":null,"abstract":"When dealing with sensitive and personal user data, the process of record linkage raises privacy issues. Thus, privacy preserving record linkage has emerged with the goal of identifying matching records across multiple data sources while preserving the privacy of the individuals they describe. The task is very resource demanding, considering the abundance of available data, which, in addition, are often dirty. Blocking techniques are deployed prior to matching to prune out unlikely to match candidate records so as to reduce processing time. However, when scaling to large datasets, such methods often result in quality loss. To this end, we propose Multi-Sampling Transitive Closure for Encrypted Fields (MS-TCEF), a novel privacy preserving blocking technique based on the use of reference sets. Our new method effectively prunes records based on redundant assignments to blocks, providing better fault-tolerance and maintaining result quality while scaling linearly with respect to the dataset size. We provide a theoretical analysis on the method's complexity and show how it outperforms state-of-the-art privacy preserving blocking techniques with respect to both recall and processing cost.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129486944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Inferring Air Quality for Station Location Recommendation Based on Urban Big Data 基于城市大数据的站点选址空气质量推断

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783344

Hsun-Ping Hsieh, Shou-de Lin, Yu Zheng

引用次数: 152

Modeling User Mobility for Location Promotion in Location-based Social Networks 基于位置的社交网络中位置推广的用户移动性建模

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783331

Wen-Yuan Zhu, Wen-Chih Peng, Ling-Jyh Chen, Kai Zheng, Xiaofang Zhou

{"title":"Modeling User Mobility for Location Promotion in Location-based Social Networks","authors":"Wen-Yuan Zhu, Wen-Chih Peng, Ling-Jyh Chen, Kai Zheng, Xiaofang Zhou","doi":"10.1145/2783258.2783331","DOIUrl":"https://doi.org/10.1145/2783258.2783331","url":null,"abstract":"With the explosion of smartphones and social network services, location-based social networks (LBSNs) are increasingly seen as tools for businesses (e.g., restaurants, hotels) to promote their products and services. In this paper, we investigate the key techniques that can help businesses promote their locations by advertising wisely through the underlying LBSNs. In order to maximize the benefit of location promotion, we formalize it as an influence maximization problem in an LBSN, i.e., given a target location and an LBSN, which a set of k users (called seeds) should be advertised initially such that they can successfully propagate and attract most other users to visit the target location. Existing studies have proposed different ways to calculate the information propagation probability, that is how likely a user may influence another, in the settings of static social network. However, it is more challenging to derive the propagation probability in an LBSN since it is heavily affected by the target location and the user mobility, both of which are dynamic and query dependent. This paper proposes two user mobility models, namely Gaussian-based and distance-based mobility models, to capture the check-in behavior of individual LBSN user, based on which location-aware propagation probabilities can be derived respectively. Extensive experiments based on two real LBSN datasets have demonstrated the superior effectiveness of our proposals than existing static models of propagation probabilities to truly reflect the information propagation in LBSNs.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129838151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction 流感样疾病病例数预测的动态泊松自回归

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783291

Z. Wang, Prithwish Chakraborty, S. Mekaru, J. Brownstein, Jieping Ye, Naren Ramakrishnan

{"title":"Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction","authors":"Z. Wang, Prithwish Chakraborty, S. Mekaru, J. Brownstein, Jieping Ye, Naren Ramakrishnan","doi":"10.1145/2783258.2783291","DOIUrl":"https://doi.org/10.1145/2783258.2783291","url":null,"abstract":"Influenza-like-illness (ILI) is among of the most common diseases worldwide, and reliable forecasting of the same can have significant public health benefits. Recently, new forms of disease surveillance based upon digital data sources have been proposed and are continuing to attract attention over traditional surveillance methods. In this paper, we focus on short-term ILI case count prediction and develop a dynamic Poisson autoregressive model with exogenous inputs variables (DPARX) for flu forecasting. In this model, we allow the autoregressive model to change over time. In order to control the variation in the model, we construct a model similarity graph to specify the relationship between pairs of models at two time points and embed prior knowledge in terms of the structure of the graph. We formulate ILI case count forecasting as a convex optimization problem, whose objective balances the autoregressive loss and the model similarity regularization induced by the structure of the similarity graph. We then propose an efficient algorithm to solve this problem by block coordinate descent. We apply our model and the corresponding learning method on historical ILI records for 15 countries around the world using a variety of syndromic surveillance data sources. Our approach provides consistently better forecasting results than state-of-the-art models available for short-term ILI case count forecasting.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122611641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT) 捷运流量计量与路线推荐系统

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2788590

Thomas Holleczek, D. Anh, Shanyang Yin, Yunye Jin, S. Antonatos, H. Goh, Samantha Low, A. Nash

{"title":"Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT)","authors":"Thomas Holleczek, D. Anh, Shanyang Yin, Yunye Jin, S. Antonatos, H. Goh, Samantha Low, A. Nash","doi":"10.1145/2783258.2788590","DOIUrl":"https://doi.org/10.1145/2783258.2788590","url":null,"abstract":"Understanding how people use public transport is important for the operation and future planning of the underlying transport networks. We have therefore developed and deployed a traffic measurement system for a key player in the transportation industry to gain insights into crowd behavior for planning purposes. The system has been in operation for several months and reports, at hourly intervals, (1) the crowdedness of subway stations, (2) the flows of people inside interchange stations, and (3) the expected travel time for each possible route in the subway network of Singapore. The core of our system is an efficient algorithm which detects individual subway trips from anonymized real-time data generated by the location based system of Singtel, the country's largest telecommunications company. To assess the accuracy of our system, we engaged an independent market research company to conduct a field study--a manual count of the number of passengers boarding and disembarking at a selected station on three separate days. A strong correlation between the calculations of our algorithm and the manual counts was found. One of our key findings is that travelers do not always choose the route with the shortest travel time in the subway network of Singapore. We have therefore also been developing a mobile app which allows users to plan their trips based on the average travel time between stations.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122311015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Robust Treecode Approximation for Kernel Machines 核机的鲁棒树码逼近

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783272

William B. March, Bo Xiao, Sameer Tharakan, Chenhan D. Yu, G. Biros

{"title":"Robust Treecode Approximation for Kernel Machines","authors":"William B. March, Bo Xiao, Sameer Tharakan, Chenhan D. Yu, G. Biros","doi":"10.1145/2783258.2783272","DOIUrl":"https://doi.org/10.1145/2783258.2783272","url":null,"abstract":"Since exact evaluation of a kernel matrix requires O(N2) work, scalable learning algorithms using kernels must approximate the kernel matrix. This approximation must be robust to the kernel parameters, for example the bandwidth for the Gaussian kernel. We consider two approximation methods: Nystrom and an algebraic treecode developed in our group. Nystrom methods construct a global low-rank approximation of the kernel matrix. Treecodes approximate just the off-diagonal blocks, typically using a hierarchical decomposition. We present a theoretical error analysis of our treecode and relate it to the error of Nystrom methods. Our analysis reveals how the block-rank structure of the kernel matrix controls the performance of the treecode. We evaluate our treecode by comparing it to the classical Nystrom method and a state-of-the-art fast approximate Nystrom method. We test the kernel matrix approximation accuracy for several different bandwidths and datasets. On the MNIST2M dataset (2M points in 784 dimensions) for a Gaussian kernel with bandwidth h=1, the Nystrom methods' error is over 90% whereas our treecode delivers error less than 1%. We also test the performance of the three methods on binary classification using two models: a Bayes classifier and kernel ridge regression. Our evaluation reveals the existence of bandwidth values that should be examined in cross-validation but whose corresponding kernel matrices cannot be approximated well by Nystrom methods. In contrast, the treecode scheme performs much better for these values.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126764137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Discovering and Exploiting Deterministic Label Relationships in Multi-Label Learning 发现和利用多标签学习中的确定性标签关系

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783302

C. Papagiannopoulou, Grigorios Tsoumakas, I. Tsamardinos

引用次数: 19

Building Discriminative User Profiles for Large-scale Content Recommendation 为大规模内容推荐建立判别用户配置文件

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2788610

Erheng Zhong, N. Liu, Yue Shi, Suju Rajan

{"title":"Building Discriminative User Profiles for Large-scale Content Recommendation","authors":"Erheng Zhong, N. Liu, Yue Shi, Suju Rajan","doi":"10.1145/2783258.2788610","DOIUrl":"https://doi.org/10.1145/2783258.2788610","url":null,"abstract":"Content recommendation systems are typically based on one of the following paradigms: user based customization, or recommendations based on either collaborative filtering or low rank matrix factorization methods, or with systems that impute user interest profiles based on content browsing behavior and retrieve items similar to the interest profiles. All of these systems have a distinct disadvantage, namely data sparsity and cold-start on items or users. Furthermore, very few content recommendation solutions explicitly model the wealth of information in implicit negative feedback from the users. In this paper, we propose a hybrid solution that makes use of a latent factor model to infer user interest vectors. The hybrid approach enables us to overcome both the data sparsity and cold-start problems. Our proposed method is learned purely on implicit user feedback, both positive and negative. Exploiting the information in the negative feedback allows the user profiles generated to be discriminative. We also provide a Map/Reduce framework based implementation that enables scaling our solution to real-world recommendation problems. We demonstrate the efficacy of our proposed approach with both offline experiments and A/B tests on live traffic on Yahoo properties.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127783661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23