{"title":"Data Science from the Lab to the Field to the Enterprise","authors":"Christopher M. White","doi":"10.1145/2783258.2790461","DOIUrl":"https://doi.org/10.1145/2783258.2790461","url":null,"abstract":"DARPA has been investing in data science and building open source tools for applications ranging from counter threat finance, through radar operations and cancer research, to anti-human trafficking. This presentation will cover previous work at DARPA, experience building real-world applications for defense and law enforcement to analyze data, and the future of computer science as an enabler for content discovery, information extraction, relevance determination, and information visualization. The talk will be a mix of background, detailed examples, and software demonstration. It will cover the importance of anchoring in applications, minimization of design-to-testing time, development with users-in-the-loop, error tolerance of machine learning, design for diverse user populations, and the necessity of open source software integration. It will conclude by covering a few next directions for special projects at Microsoft.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114515000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Himabindu Lakkaraju, Everaldo Aguiar, Carl Shan, David I Miller, Nasir Bhanpuri, R. Ghani, Kecia L. Addison
{"title":"A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes","authors":"Himabindu Lakkaraju, Everaldo Aguiar, Carl Shan, David I Miller, Nasir Bhanpuri, R. Ghani, Kecia L. Addison","doi":"10.1145/2783258.2788620","DOIUrl":"https://doi.org/10.1145/2783258.2788620","url":null,"abstract":"Many school districts have developed successful intervention programs to help students graduate high school on time. However, identifying and prioritizing students who need those interventions the most remains challenging. This paper describes a machine learning framework to identify such students, discusses features that are useful for this task, applies several classification algorithms, and evaluates them using metrics important to school administrators. To help test this framework and make it practically useful, we partnered with two U.S. school districts with a combined enrollment of approximately 200,000 students. We together designed several evaluation metrics to assess the goodness of machine learning algorithms from an educator's perspective. This paper focuses on students at risk of not finishing high school on time, but our framework lays a strong foundation for future work on other adverse academic outcomes.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122140459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Agarwal, Bee-Chung Chen, Qi He, Zhenhao Hua, Guy Lebanon, Yiming Ma, Pannagadatta K. Shivaswamy, Hsiao-Ping Tseng, Jaewon Yang, L. Zhang
{"title":"Personalizing LinkedIn Feed","authors":"D. Agarwal, Bee-Chung Chen, Qi He, Zhenhao Hua, Guy Lebanon, Yiming Ma, Pannagadatta K. Shivaswamy, Hsiao-Ping Tseng, Jaewon Yang, L. Zhang","doi":"10.1145/2783258.2788614","DOIUrl":"https://doi.org/10.1145/2783258.2788614","url":null,"abstract":"LinkedIn dynamically delivers update activities from a user's interpersonal network to more than 300 million members in the personalized feed that ranks activities according their \"relevance\" to the user. This paper discloses the implementation details behind this personalized feed system at LinkedIn which can not be found from related work, and addresses the scalability and data sparsity challenges for deploying the system online. More specifically, we focus on the personalization models by generating three kinds of affinity scores: Viewer-ActivityType Affinity, Viewer-Actor Affinity, and Viewer-Actor-ActivityType Affinity. Extensive experiments based on online bucket tests (A/B experiments) and offline evaluation illustrate the effect of our personalization models in LinkedIn feed.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116742429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collective Opinion Spam Detection: Bridging Review Networks and Metadata","authors":"Shebuti Rayana, L. Akoglu","doi":"10.1145/2783258.2783370","DOIUrl":"https://doi.org/10.1145/2783258.2783370","url":null,"abstract":"Online reviews capture the testimonials of \"real\" people and help shape the decisions of other consumers. Due to the financial gains associated with positive reviews, however, opinion spam has become a widespread problem, with often paid spam reviewers writing fake reviews to unjustly promote or demote certain products or businesses. Existing approaches to opinion spam have successfully but separately utilized linguistic clues of deception, behavioral footprints, or relational ties between agents in a review system. In this work, we propose a new holistic approach called SPEAGLE that utilizes clues from all metadata (text, timestamp, rating) as well as relational data (network), and harness them collectively under a unified framework to spot suspicious users and reviews, as well as products targeted by spam. Moreover, our method can efficiently and seamlessly integrate semi-supervision, i.e., a (small) set of labels if available, without requiring any training or changes in its underlying algorithm. We demonstrate the effectiveness and scalability of SPEAGLE on three real-world review datasets from Yelp.com with filtered (spam) and recommended (non-spam) reviews, where it significantly outperforms several baselines and state-of-the-art methods. To the best of our knowledge, this is the largest scale quantitative evaluation performed to date for the opinion spam problem.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117094712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MOOCS: What Have We Learned?","authors":"D. Koller","doi":"10.1145/2783258.2785465","DOIUrl":"https://doi.org/10.1145/2783258.2785465","url":null,"abstract":"It has been nearly four years since the first MOOCs (massive open online courses) were offered by Stanford University. MOOCs are now offered to tens of millions of learners worldwide, by hundreds of top universities. MOOCs are no longer an experiment - the learning, reach, and value they offer are now a reality. I will show how MOOCs provide opportunities for open-ended projects, intercultural learner interactions, and collaborative learning. I will discuss some of data that we are collecting from MOOCs, and what we are learning from these data about both courses and learners. Finally, I will discuss both data and examples of the kind of transformative impact that can be derived from providing millions of people with access to the world's best education.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128628075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dimensionality Reduction Via Graph Structure Learning","authors":"Qi Mao, Li Wang, S. Goodison, Yijun Sun","doi":"10.1145/2783258.2783309","DOIUrl":"https://doi.org/10.1145/2783258.2783309","url":null,"abstract":"We present a new dimensionality reduction setting for a large family of real-world problems. Unlike traditional methods, the new setting aims to explicitly represent and learn an intrinsic structure from data in a high-dimensional space, which can greatly facilitate data visualization and scientific discovery in downstream analysis. We propose a new dimensionality-reduction framework that involves the learning of a mapping function that projects data points in the original high-dimensional space to latent points in a low-dimensional space that are then used directly to construct a graph. Local geometric information of the projected data is naturally captured by the constructed graph. As a showcase, we develop a new method to obtain a discriminative and compact feature representation for clustering problems. In contrast to assumptions used in traditional clustering methods, we assume that centers of clusters should be close to each other if they are connected in a learned graph, and other cluster centers should be distant. Extensive experiments are performed that demonstrate that the proposed method is able to obtain discriminative feature representations yielding superior clustering performance, and correctly recover the intrinsic structures of various real-world datasets including curves, hierarchies and a cancer progression path.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129670773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Potash, J. Brew, Alexander Loewi, S. Majumdar, Andrew G. Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, R. Ghani
{"title":"Predictive Modeling for Public Health: Preventing Childhood Lead Poisoning","authors":"E. Potash, J. Brew, Alexander Loewi, S. Majumdar, Andrew G. Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, R. Ghani","doi":"10.1145/2783258.2788629","DOIUrl":"https://doi.org/10.1145/2783258.2788629","url":null,"abstract":"Lead poisoning is a major public health problem that affects hundreds of thousands of children in the United States every year. A common approach to identifying lead hazards is to test all children for elevated blood lead levels and then investigate and remediate the homes of children with elevated tests. This can prevent exposure to lead of future residents, but only after a child has been poisoned. This paper describes joint work with the Chicago Department of Public Health (CDPH) in which we build a model that predicts the risk of a child to being poisoned so that an intervention can take place before that happens. Using two decades of blood lead level tests, home lead inspections, property value assessments, and census data, our model allows inspectors to prioritize houses on an intractably long list of potential hazards and identify children who are at the highest risk. This work has been described by CDPH as pioneering in the use of machine learning and predictive analytics in public health and has the potential to have a significant impact on both health and economic outcomes for communities across the US.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130500456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Amsaleg, Oussama Chelly, T. Furon, S. Girard, M. Houle, K. Kawarabayashi, Michael Nett
{"title":"Estimating Local Intrinsic Dimensionality","authors":"L. Amsaleg, Oussama Chelly, T. Furon, S. Girard, M. Houle, K. Kawarabayashi, Michael Nett","doi":"10.1145/2783258.2783405","DOIUrl":"https://doi.org/10.1145/2783258.2783405","url":null,"abstract":"This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded as an extension of Karger and Ruhl's expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality can be particularly useful in search, classification, outlier detection, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. Several estimators of local ID are proposed and analyzed based on extreme value theory, using maximum likelihood estimation (MLE), the method of moments (MoM), probability weighted moments (PWM), and regularly varying functions (RV). An experimental evaluation is also provided, using both real and artificial data.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129226159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams","authors":"Yongsub Lim, U. Kang","doi":"10.1145/2783258.2783285","DOIUrl":"https://doi.org/10.1145/2783258.2783285","url":null,"abstract":"How can we estimate local triangle counts accurately in a graph stream without storing the whole graph? The local triangle counting which counts triangles for each node in a graph is a very important problem with wide applications in social network analysis, anomaly detection, web mining, etc. In this paper, we propose MASCOT, a memory-efficient and accurate method for local triangle estimation in a graph stream based on edge sampling. To develop MASCOT, we first present two naive local triangle counting algorithms in a graph stream: MASCOT-C and MASCOT-A. MASCOT-C is based on constant edge sampling, and MASCOT-A improves its accuracy by utilizing more memory spaces. MASCOT achieves both accuracy and memory-efficiency of the two algorithms by an unconditional triangle counting for a new edge, regardless of whether it is sampled or not. In contrast to the existing algorithm which requires prior knowledge on the target graph and appropriately set parameters, MASCOT requires only one simple parameter, the edge sampling probability. Through extensive experiments, we show that for the same number of edges sampled, MASCOT provides the best accuracy compared to the existing algorithm as well as MASCOT-C and MASCOT-A. Thanks to MASCOT, we also discover interesting anomalous patterns in real graphs, like core-peripheries in the web and ambiguous author names in DBLP.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127700187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data","authors":"Dan Feldman, Tamir Tassa","doi":"10.1145/2783258.2783312","DOIUrl":"https://doi.org/10.1145/2783258.2783312","url":null,"abstract":"We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some $ellz error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF). Our main algorithm reduces a given n x d matrix into a small, ε-dependent, weighted subset C of its rows (known as a coreset), whose size is independent of both n and d. We then prove that applying existing algorithms on the resulting coreset can be turned into (1+ε)-approximations for the original (large) input matrix. In particular, we provide the first linear time approximation scheme (LTAS) for the rank-one NMF. The coreset C can be computed in parallel and using only one pass over a possibly unbounded stream of row vectors. In this sense we improve the result in [4] (Best paper of STOC 2013). Moreover, since C is a subset of these rows, its construction time, as well as its sparsity (number of non-zeroes entries) and the sparsity of the resulting low rank approximation depend on the maximum sparsity of an input row, and not on the actual dimension d. In this sense, we improve the result of Libery [21](Best paper of KDD 2013) and answer affirmably, and in a more general setting, his open question of computing such a coreset. Source code is provided for reproducing the experiments and integration with existing and future algorithms.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131394927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}