Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

筛选
英文 中文
Why It Happened: Identifying and Modeling the Reasons of the Happening of Social Events 为什么会发生:社会事件发生的原因识别与建模
Yu Rong, Hong Cheng, Zhiyu Mo
{"title":"Why It Happened: Identifying and Modeling the Reasons of the Happening of Social Events","authors":"Yu Rong, Hong Cheng, Zhiyu Mo","doi":"10.1145/2783258.2783305","DOIUrl":"https://doi.org/10.1145/2783258.2783305","url":null,"abstract":"In nowadays social networks, a huge volume of content containing rich information, such as reviews, ratings, microblogs, etc., is being generated, consumed and diffused by users all the time. Given the temporal information, we can obtain the event cascade which indicates the time sequence of the arrival of information to users. Many models have been proposed to explain how information diffuses. However, most existing models cannot give a clear explanation why every specific event happens in the event cascade. Such explanation is essential for us to have a deeper understanding of information diffusion as well as a better prediction of future event cascade. In order to uncover the mechanism of the happening of social events, we analyze the rating event data crawled from Douban.com, a Chinese social network, from year 2006 to 2011. We distinguish three factors: social, external and intrinsic influence which can explain the emergence of every specific event. Then we use the mixed Poisson process to model event cascade generated by different factors respectively and integrate different Poisson processes with shared parameters. The proposed model, called Combinational Mixed Poisson Process (CMPP) model, can explain not only how information diffuses in social networks, but also why a specific event happens. This model can help us to understand information diffusion from both macroscopic and microscopic perspectives. We develop an efficient Classification EM algorithm to infer the model parameters. The explanatory and predictive power of the proposed model has been demonstrated by the experiments on large real data sets.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122166747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Reconstructing Textual Documents from n-grams 从n-grams重构文本文档
Matthias Gallé, Matías Tealdi
{"title":"Reconstructing Textual Documents from n-grams","authors":"Matthias Gallé, Matías Tealdi","doi":"10.1145/2783258.2783361","DOIUrl":"https://doi.org/10.1145/2783258.2783361","url":null,"abstract":"We analyze the problem of reconstructing documents when we only have access to the n-grams (for n fixed) and their counts from the original documents. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph, preserving all possible reconstructions while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125192588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable Blocking for Privacy Preserving Record Linkage 隐私保护记录链接的可伸缩阻塞
Alexandros Karakasidis, Georgia Koloniari, Vassilios S. Verykios
{"title":"Scalable Blocking for Privacy Preserving Record Linkage","authors":"Alexandros Karakasidis, Georgia Koloniari, Vassilios S. Verykios","doi":"10.1145/2783258.2783290","DOIUrl":"https://doi.org/10.1145/2783258.2783290","url":null,"abstract":"When dealing with sensitive and personal user data, the process of record linkage raises privacy issues. Thus, privacy preserving record linkage has emerged with the goal of identifying matching records across multiple data sources while preserving the privacy of the individuals they describe. The task is very resource demanding, considering the abundance of available data, which, in addition, are often dirty. Blocking techniques are deployed prior to matching to prune out unlikely to match candidate records so as to reduce processing time. However, when scaling to large datasets, such methods often result in quality loss. To this end, we propose Multi-Sampling Transitive Closure for Encrypted Fields (MS-TCEF), a novel privacy preserving blocking technique based on the use of reference sets. Our new method effectively prunes records based on redundant assignments to blocks, providing better fault-tolerance and maintaining result quality while scaling linearly with respect to the dataset size. We provide a theoretical analysis on the method's complexity and show how it outperforms state-of-the-art privacy preserving blocking techniques with respect to both recall and processing cost.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129486944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Inferring Air Quality for Station Location Recommendation Based on Urban Big Data 基于城市大数据的站点选址空气质量推断
Hsun-Ping Hsieh, Shou-de Lin, Yu Zheng
{"title":"Inferring Air Quality for Station Location Recommendation Based on Urban Big Data","authors":"Hsun-Ping Hsieh, Shou-de Lin, Yu Zheng","doi":"10.1145/2783258.2783344","DOIUrl":"https://doi.org/10.1145/2783258.2783344","url":null,"abstract":"This paper tries to answer two questions. First, how to infer real-time air quality of any arbitrary location given environmental data and historical air quality data from very sparse monitoring locations. Second, if one needs to establish few new monitoring stations to improve the inference quality, how to determine the best locations for such purpose? The problems are challenging since for most of the locations (>99%) in a city we do not have any air quality data to train a model from. We design a semi-supervised inference model utilizing existing monitoring data together with heterogeneous city dynamics, including meteorology, human mobility, structure of road networks, and point of interests (POIs). We also propose an entropy-minimization model to suggest the best locations to establish new monitoring stations. We evaluate the proposed approach using Beijing air quality data, resulting in clear advantages over a series of state-of-the-art and commonly used methods.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"49 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128527557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
Modeling User Mobility for Location Promotion in Location-based Social Networks 基于位置的社交网络中位置推广的用户移动性建模
Wen-Yuan Zhu, Wen-Chih Peng, Ling-Jyh Chen, Kai Zheng, Xiaofang Zhou
{"title":"Modeling User Mobility for Location Promotion in Location-based Social Networks","authors":"Wen-Yuan Zhu, Wen-Chih Peng, Ling-Jyh Chen, Kai Zheng, Xiaofang Zhou","doi":"10.1145/2783258.2783331","DOIUrl":"https://doi.org/10.1145/2783258.2783331","url":null,"abstract":"With the explosion of smartphones and social network services, location-based social networks (LBSNs) are increasingly seen as tools for businesses (e.g., restaurants, hotels) to promote their products and services. In this paper, we investigate the key techniques that can help businesses promote their locations by advertising wisely through the underlying LBSNs. In order to maximize the benefit of location promotion, we formalize it as an influence maximization problem in an LBSN, i.e., given a target location and an LBSN, which a set of k users (called seeds) should be advertised initially such that they can successfully propagate and attract most other users to visit the target location. Existing studies have proposed different ways to calculate the information propagation probability, that is how likely a user may influence another, in the settings of static social network. However, it is more challenging to derive the propagation probability in an LBSN since it is heavily affected by the target location and the user mobility, both of which are dynamic and query dependent. This paper proposes two user mobility models, namely Gaussian-based and distance-based mobility models, to capture the check-in behavior of individual LBSN user, based on which location-aware propagation probabilities can be derived respectively. Extensive experiments based on two real LBSN datasets have demonstrated the superior effectiveness of our proposals than existing static models of propagation probabilities to truly reflect the information propagation in LBSNs.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129838151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction 流感样疾病病例数预测的动态泊松自回归
Z. Wang, Prithwish Chakraborty, S. Mekaru, J. Brownstein, Jieping Ye, Naren Ramakrishnan
{"title":"Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction","authors":"Z. Wang, Prithwish Chakraborty, S. Mekaru, J. Brownstein, Jieping Ye, Naren Ramakrishnan","doi":"10.1145/2783258.2783291","DOIUrl":"https://doi.org/10.1145/2783258.2783291","url":null,"abstract":"Influenza-like-illness (ILI) is among of the most common diseases worldwide, and reliable forecasting of the same can have significant public health benefits. Recently, new forms of disease surveillance based upon digital data sources have been proposed and are continuing to attract attention over traditional surveillance methods. In this paper, we focus on short-term ILI case count prediction and develop a dynamic Poisson autoregressive model with exogenous inputs variables (DPARX) for flu forecasting. In this model, we allow the autoregressive model to change over time. In order to control the variation in the model, we construct a model similarity graph to specify the relationship between pairs of models at two time points and embed prior knowledge in terms of the structure of the graph. We formulate ILI case count forecasting as a convex optimization problem, whose objective balances the autoregressive loss and the model similarity regularization induced by the structure of the similarity graph. We then propose an efficient algorithm to solve this problem by block coordinate descent. We apply our model and the corresponding learning method on historical ILI records for 15 countries around the world using a variety of syndromic surveillance data sources. Our approach provides consistently better forecasting results than state-of-the-art models available for short-term ILI case count forecasting.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122611641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT) 捷运流量计量与路线推荐系统
Thomas Holleczek, D. Anh, Shanyang Yin, Yunye Jin, S. Antonatos, H. Goh, Samantha Low, A. Nash
{"title":"Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT)","authors":"Thomas Holleczek, D. Anh, Shanyang Yin, Yunye Jin, S. Antonatos, H. Goh, Samantha Low, A. Nash","doi":"10.1145/2783258.2788590","DOIUrl":"https://doi.org/10.1145/2783258.2788590","url":null,"abstract":"Understanding how people use public transport is important for the operation and future planning of the underlying transport networks. We have therefore developed and deployed a traffic measurement system for a key player in the transportation industry to gain insights into crowd behavior for planning purposes. The system has been in operation for several months and reports, at hourly intervals, (1) the crowdedness of subway stations, (2) the flows of people inside interchange stations, and (3) the expected travel time for each possible route in the subway network of Singapore. The core of our system is an efficient algorithm which detects individual subway trips from anonymized real-time data generated by the location based system of Singtel, the country's largest telecommunications company. To assess the accuracy of our system, we engaged an independent market research company to conduct a field study--a manual count of the number of passengers boarding and disembarking at a selected station on three separate days. A strong correlation between the calculations of our algorithm and the manual counts was found. One of our key findings is that travelers do not always choose the route with the shortest travel time in the subway network of Singapore. We have therefore also been developing a mobile app which allows users to plan their trips based on the average travel time between stations.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122311015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Robust Treecode Approximation for Kernel Machines 核机的鲁棒树码逼近
William B. March, Bo Xiao, Sameer Tharakan, Chenhan D. Yu, G. Biros
{"title":"Robust Treecode Approximation for Kernel Machines","authors":"William B. March, Bo Xiao, Sameer Tharakan, Chenhan D. Yu, G. Biros","doi":"10.1145/2783258.2783272","DOIUrl":"https://doi.org/10.1145/2783258.2783272","url":null,"abstract":"Since exact evaluation of a kernel matrix requires O(N2) work, scalable learning algorithms using kernels must approximate the kernel matrix. This approximation must be robust to the kernel parameters, for example the bandwidth for the Gaussian kernel. We consider two approximation methods: Nystrom and an algebraic treecode developed in our group. Nystrom methods construct a global low-rank approximation of the kernel matrix. Treecodes approximate just the off-diagonal blocks, typically using a hierarchical decomposition. We present a theoretical error analysis of our treecode and relate it to the error of Nystrom methods. Our analysis reveals how the block-rank structure of the kernel matrix controls the performance of the treecode. We evaluate our treecode by comparing it to the classical Nystrom method and a state-of-the-art fast approximate Nystrom method. We test the kernel matrix approximation accuracy for several different bandwidths and datasets. On the MNIST2M dataset (2M points in 784 dimensions) for a Gaussian kernel with bandwidth h=1, the Nystrom methods' error is over 90% whereas our treecode delivers error less than 1%. We also test the performance of the three methods on binary classification using two models: a Bayes classifier and kernel ridge regression. Our evaluation reveals the existence of bandwidth values that should be examined in cross-validation but whose corresponding kernel matrices cannot be approximated well by Nystrom methods. In contrast, the treecode scheme performs much better for these values.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126764137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Discovering and Exploiting Deterministic Label Relationships in Multi-Label Learning 发现和利用多标签学习中的确定性标签关系
C. Papagiannopoulou, Grigorios Tsoumakas, I. Tsamardinos
{"title":"Discovering and Exploiting Deterministic Label Relationships in Multi-Label Learning","authors":"C. Papagiannopoulou, Grigorios Tsoumakas, I. Tsamardinos","doi":"10.1145/2783258.2783302","DOIUrl":"https://doi.org/10.1145/2783258.2783302","url":null,"abstract":"This work presents a probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailment: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented as a deterministic Bayesian network. Marginal probabilities are entered as soft evidence in the network and through probabilistic inference become consistent with the discovered knowledge. Our approach offers robust improvements in mean average precision compared to the standard binary relevance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126361964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Building Discriminative User Profiles for Large-scale Content Recommendation 为大规模内容推荐建立判别用户配置文件
Erheng Zhong, N. Liu, Yue Shi, Suju Rajan
{"title":"Building Discriminative User Profiles for Large-scale Content Recommendation","authors":"Erheng Zhong, N. Liu, Yue Shi, Suju Rajan","doi":"10.1145/2783258.2788610","DOIUrl":"https://doi.org/10.1145/2783258.2788610","url":null,"abstract":"Content recommendation systems are typically based on one of the following paradigms: user based customization, or recommendations based on either collaborative filtering or low rank matrix factorization methods, or with systems that impute user interest profiles based on content browsing behavior and retrieve items similar to the interest profiles. All of these systems have a distinct disadvantage, namely data sparsity and cold-start on items or users. Furthermore, very few content recommendation solutions explicitly model the wealth of information in implicit negative feedback from the users. In this paper, we propose a hybrid solution that makes use of a latent factor model to infer user interest vectors. The hybrid approach enables us to overcome both the data sparsity and cold-start problems. Our proposed method is learned purely on implicit user feedback, both positive and negative. Exploiting the information in the negative feedback allows the user profiles generated to be discriminative. We also provide a Map/Reduce framework based implementation that enables scaling our solution to real-world recommendation problems. We demonstrate the efficacy of our proposed approach with both offline experiments and A/B tests on live traffic on Yahoo properties.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127783661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信