Nurjahan Begum, Liudmila Ulanova, Jun Wang, Eamonn J. Keogh
{"title":"Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy","authors":"Nurjahan Begum, Liudmila Ulanova, Jun Wang, Eamonn J. Keogh","doi":"10.1145/2783258.2783286","DOIUrl":"https://doi.org/10.1145/2783258.2783286","url":null,"abstract":"Clustering time series is a useful operation in its own right, and an important subroutine in many higher-level data mining analyses, including data editing for classifiers, summarization, and outlier detection. While it has been noted that the general superiority of Dynamic Time Warping (DTW) over Euclidean Distance for similarity search diminishes as we consider ever larger datasets, as we shall show, the same is not true for clustering. Thus, clustering time series under DTW remains a computationally challenging task. In this work, we address this lethargy in two ways. We propose a novel pruning strategy that exploits both upper and lower bounds to prune off a large fraction of the expensive distance calculations. This pruning strategy is admissible; giving us provably identical results to the brute force algorithm, but is at least an order of magnitude faster. For datasets where even this level of speedup is inadequate, we show that we can use a simple heuristic to order the unavoidable calculations in a most-useful-first ordering, thus casting the clustering as an anytime algorithm. We demonstrate the utility of our ideas with both single and multidimensional case studies in the domains of astronomy, speech physiology, medicine and entomology.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128889635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gas Concentration Reconstruction for Coal-Fired Boilers Using Gaussian Process","authors":"Chao Yuan, Matthias Behmann, B. Meerbeck","doi":"10.1145/2783258.2788617","DOIUrl":"https://doi.org/10.1145/2783258.2788617","url":null,"abstract":"The goal of combustion optimization of a coal-fired boiler is to improve its operating efficiency while reducing emissions at the same time. Being able to take measurements for key combustion ingredients, such as O2, CO, H2O is crucial for the feedback loop needed by this task. One state-of-the-art laser technique, namely, Tunable Diode Laser Absorption Spectroscopy (TDLAS) is able to measure the average value of gas concentration along a laser beam path. A active research direction in TDLAS is how to reconstruct gas concentration images based on these path averages. However, in reality the number of such paths is usually very limited, leading to an extremely under-constrained estimation problem. Another overlooked aspect of the problem is that how can we arrange paths such that the reconstructed image is more accurate? We propose a Bayesian approach based on Gaussian process (GP) to address both image reconstruction and path arrangement problems, simultaneously. Specifically, we use the GP posterior mean as the reconstructed image, and average posterior pixel variance as our objective function to optimize the path arrangement. Our algorithms have been integrated in Siemens SPPA-P3000 control system that provides real-time combustion optimization of boilers around the world.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129037506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longbing Cao, Chengqi Zhang, T. Joachims, G. Webb, D. Margineantu, Graham J. Williams
{"title":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","authors":"Longbing Cao, Chengqi Zhang, T. Joachims, G. Webb, D. Margineantu, Graham J. Williams","doi":"10.1145/2783258","DOIUrl":"https://doi.org/10.1145/2783258","url":null,"abstract":"It is our great pleasure to welcome you to the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Theannual ACM SIGKDD conference is the premier international forum for data science, data mining, knowledge discovery and big data. It brings together researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. \u0000 \u0000KDD-2015 features 4 plenary keynote presentations, 12 invited talks, 228 paper presentations, a discussion panel, a poster session, 14 workshops, 12 tutorials, 27 exhibition booths, the KDD Cup competition, and a banquet at the Dockside Pavilion at the Sydney Darling Harbour. As always, KDD-2015 attracted presenters and delegates from around the world. It is with great pleasure that we bring this international conference for the first time to the southern hemisphere. \u0000 \u0000This year we again had a strong set of submissions. There were 819 submissions to the Research Track, of which 160 papers were accepted. There were 189 submissions to the Industry & Government Track, of which 68 papers were accepted. All papers submitted to the Research Track and to the Industry & Government Tracks were subjected to a rigorous review process. They were initially screened by the Chairs of the respective tracks, and a small number of papers that did not comply with the formatting requirements or which violated the dual submission policy were summarily rejected. At least three reviewers and a metareviewer were assigned to all remaining papers based on the results of a bidding process. The authors were able to read the reviews and provide a response. The meta-reviewers then had an opportunity to consider all reviews and author responses. This then initiated a discussion during which all reviewers of a paper had the opportunity to read each other's reviews and the author responses and to update their reviews as appropriate. In a few cases, the meta-reviewers added another reviewer at this stage to gain expert opinion on specific issues. The meta-reviewers then made recommendations on acceptance or rejection to the track chairs. The track chairs then assessed the meta-reviews, reviews, author responses and discussions to make a final decision. In a few cases, they also solicited further expert reviews and meta-reviews to resolve specific questions. Thus, all papers were assessed by at least four and up to seven discipline experts. All accepted papers were presented both as a 20-minute talk and as a poster. \u0000 \u0000The Industry & Government Invited Talk Track features 12 talks from world renowned experts who have played a significant role in developing and deploying large-scale data mining applications and systems in their respective fields with clearly measurable and meaningful impact. We trust that this opportunity for the KDD community to hear directly from senior leaders in industry and government will inspire new advances and broader interdisciplinary collaboration between resea","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129278832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamically Modeling Patient's Health State from Electronic Medical Records: A Time Series Approach","authors":"Karla L. Caballero Barajas, R. Akella","doi":"10.1145/2783258.2783289","DOIUrl":"https://doi.org/10.1145/2783258.2783289","url":null,"abstract":"In this paper, we present a method to dynamically estimate the probability of mortality inside the Intensive Care Unit (ICU) by combining heterogeneous data. We propose a method based on Generalized Linear Dynamic Models that models the probability of mortality as a latent state that evolves over time. This framework allows us to combine different types of features (lab results, vital signs readings, doctor and nurse notes, etc) into a single state, which is updated each time new patient data is observed. In addition, we include the use of text features, based on medical noun phrase extraction and Statistical Topic Models. These features provide context about the patient that cannot be captured when only numerical features are used. We fill out the missing values using a Regularized Expectation Maximization based method assuming temporal data. We test our proposed approach using 15,000 Electronic Medical Records (EMRs) obtained from the MIMIC II public dataset. Experimental results show that the proposed model allows us to detect an increase in the probability of mortality before it occurs. We report an AUC 0.8657. Our proposed model clearly outperforms other methods of the literature in terms of sensitivity with 0.7885 compared to 0.6559 of Naive Bayes and F-score with 0.5929 compared to 0.4662 of Apache III score after 24 hours.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125361861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ya Xu, Nanyu Chen, A. Fernandez, Omar Sinno, Anmol Bhasin
{"title":"From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks","authors":"Ya Xu, Nanyu Chen, A. Fernandez, Omar Sinno, Anmol Bhasin","doi":"10.1145/2783258.2788602","DOIUrl":"https://doi.org/10.1145/2783258.2788602","url":null,"abstract":"A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used among online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. At LinkedIn, we have seen tremendous growth of controlled experiments over time, with now over 400 concurrent experiments running per day. General A/B testing frameworks and methodologies, including challenges and pitfalls, have been discussed extensively in several previous KDD work [7, 8, 9, 10]. In this paper, we describe in depth the experimentation platform we have built at LinkedIn and the challenges that arise particularly when running A/B tests at large scale in a social network setting. We start with an introduction of the experimentation platform and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them. It is then followed by discussions on several more sophisticated A/B testing scenarios, such as running offline experiments and addressing the network effect, where one user's action can influence that of another. Lastly, we talk about features and processes that are crucial for building a strong experimentation culture.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121535198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clouded Intelligence","authors":"Joseph Sirosh","doi":"10.1145/2783258.2790454","DOIUrl":"https://doi.org/10.1145/2783258.2790454","url":null,"abstract":"Several exciting trends are driving the birth of the intelligent cloud. The vast majority of world's data is now connected data resident in the cloud. The majority of world's new software is now connected software, also resident in or using the cloud. New cloud based Machine Learning as a Service platforms help transform data into intelligence and build cloud-hosted intelligent APIs for connected software applications. Face analysis, computer vision, text analysis, speech recognition, and more traditional analytics such as churn prediction, recommendations, anomaly detection, forecasting, and clustering are all available now as cloud APIs, and far more are being created at a rapid pace. Cloud hosted marketplaces for crowdsourcing intelligent APIs have been launched. In this talk I will review what these trends mean for the future of data science and show examples of revolutionary applications that you can build using cloud platforms.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126400716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fenglong Ma, Yaliang Li, Qi Li, Minghui Qiu, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, Jiawei Han
{"title":"FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation","authors":"Fenglong Ma, Yaliang Li, Qi Li, Minghui Qiu, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, Jiawei Han","doi":"10.1145/2783258.2783314","DOIUrl":"https://doi.org/10.1145/2783258.2783314","url":null,"abstract":"In crowdsourced data aggregation task, there exist conflicts in the answers provided by large numbers of sources on the same set of questions. The most important challenge for this task is to estimate source reliability and select answers that are provided by high-quality sources. Existing work solves this problem by simultaneously estimating sources' reliability and inferring questions' true answers (i.e., the truths). However, these methods assume that a source has the same reliability degree on all the questions, but ignore the fact that sources' reliability may vary significantly among different topics. To capture various expertise levels on different topics, we propose FaitCrowd, a fine grained truth discovery model for the task of aggregating conflicting data collected from multiple users/sources. FaitCrowd jointly models the process of generating question content and sources' provided answers in a probabilistic model to estimate both topical expertise and true answers simultaneously. This leads to a more precise estimation of source reliability. Therefore, FaitCrowd demonstrates better ability to obtain true answers for the questions compared with existing approaches. Experimental results on two real-world datasets show that FaitCrowd can significantly reduce the error rate of aggregation compared with the state-of-the-art multi-source aggregation approaches due to its ability of learning topical expertise from question content and collected answers.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116589808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Deep Hybrid Model for Weather Forecasting","authors":"Aditya Grover, Ashish Kapoor, E. Horvitz","doi":"10.1145/2783258.2783275","DOIUrl":"https://doi.org/10.1145/2783258.2783275","url":null,"abstract":"Weather forecasting is a canonical predictive challenge that has depended primarily on model-based methods. We explore new directions with forecasting weather as a data-intensive challenge that involves inferences across space and time. We study specifically the power of making predictions via a hybrid approach that combines discriminatively trained predictive models with a deep neural network that models the joint statistics of a set of weather-related variables. We show how the base model can be enhanced with spatial interpolation that uses learned long-range spatial dependencies. We also derive an efficient learning and inference procedure that allows for large scale optimization of the model parameters. We evaluate the methods with experiments on real-world meteorological data that highlight the promise of the approach.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114301873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-Time Bid Prediction using Thompson Sampling-Based Expert Selection","authors":"E. Ikonomovska, Sina Jafarpour, Ali Dasdan","doi":"10.1145/2783258.2788586","DOIUrl":"https://doi.org/10.1145/2783258.2788586","url":null,"abstract":"We study online meta-learners for real-time bid prediction that predict by selecting a single best predictor among several subordinate prediction algorithms, here called \"experts\". These predictors belong to the family of context-dependent past performance estimators that make a prediction only when the instance to be predicted falls within their areas of expertise. Within the advertising ecosystem, it is very common for the contextual information to be incomplete, hence, it is natural for some of the experts to abstain from making predictions on some of the instances. Experts' areas of expertise can overlap, which makes their predictions less suitable for merging; as such, they lend themselves better to the problem of best expert selection. In addition, their performance varies over time, which gives the expert selection problem a non-stochastic, adversarial flavor. In this paper we propose to use probability sampling (via Thompson Sampling) as a meta-learning algorithm that samples from the pool of experts for the purpose of bid prediction. We show performance results from the comparison of our approach to multiple state-of-the-art algorithms using exploration scavenging on a log file of over 300 million ad impressions, as well as comparison to a baseline rule-based model using production traffic from a leading DSP platform.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122802750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Whither Social Networks for Web Search?","authors":"R. Agrawal, Behzad Golshan, E. Papalexakis","doi":"10.1145/2783258.2788571","DOIUrl":"https://doi.org/10.1145/2783258.2788571","url":null,"abstract":"Access to diverse perspectives nurtures an informed citizenry. Google and Bing have emerged as the duopoly that largely arbitrates which English language documents are seen by web searchers. A recent study shows that there is now a large overlap in the top organic search results produced by them. Thus, citizens may no longer be able to gain different perspectives by using different search engines. We present the results of our empirical study that indicates that by mining Twitter data one can obtain search results that are quite distinct from those produced by Google and Bing. Additionally, our user study found that these results were quite informative. The gauntlet is now on search engines to test whether our findings hold in their infrastructure for different social networks and whether enabling diversity has sufficient business imperative for them.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122864631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}