{"title":"Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach","authors":"Chi Wang, Xueqing Liu, Yanglei Song, Jiawei Han","doi":"10.1145/2783258.2783288","DOIUrl":"https://doi.org/10.1145/2783258.2783288","url":null,"abstract":"Automatic construction of user-desired topical hierarchies over large volumes of text data is a highly desirable but challenging task. This study proposes to give users freedom to construct topical hierarchies via interactive operations such as expanding a branch and merging several branches. Existing hierarchical topic modeling techniques are inadequate for this purpose because (1) they cannot consistently preserve the topics when the hierarchy structure is modified; and (2) the slow inference prevents swift response to user requests. In this study, we propose a novel method, called STROD, that allows efficient and consistent modification of topic hierarchies, based on a recursive generative model and a scalable tensor decomposition inference algorithm with theoretical performance guarantee. Empirical evaluation shows that STROD reduces the runtime of construction by several orders of magnitude, while generating consistent and quality hierarchies.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115415228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"User Modeling in Telecommunications and Internet Industry","authors":"Qiang Yang","doi":"10.1145/2783258.2790459","DOIUrl":"https://doi.org/10.1145/2783258.2790459","url":null,"abstract":"It is extremely important in many application domains to have accurate models of user behavior. Data mining allows user models to be constructed based on vast available data automatically. User modeling has found applications in mobile APP recommendations, social networking, financial product marketing and customer service in telecommunications. Successful user modeling should be aware of several critical issues: who are the target users' How should the solutions be updated when new data come in? How should user feedback be handled? What are the \"pain\" points of users' In this talk, I will discuss my own experience on user modeling with big data. I will draw examples from telecommunications and the Internet industry, contrasting and highlighting some lessons learned in these industries.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125216887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li
{"title":"Entity Matching across Heterogeneous Sources","authors":"Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li","doi":"10.1145/2783258.2783353","DOIUrl":"https://doi.org/10.1145/2783258.2783353","url":null,"abstract":"Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Subspace Clustering Using Log-determinant Rank Approximation","authors":"Chong Peng, Zhao Kang, Huiqing Li, Q. Cheng","doi":"10.1145/2783258.2783303","DOIUrl":"https://doi.org/10.1145/2783258.2783303","url":null,"abstract":"A number of machine learning and computer vision problems, such as matrix completion and subspace clustering, require a matrix to be of low-rank. To meet this requirement, most existing methods use the nuclear norm as a convex proxy of the rank function and minimize it. However, the nuclear norm simply adds all nonzero singular values together instead of treating them equally as the rank function does, which may not be a good rank approximation when some singular values are very large. To reduce this undesirable weighting effect, we use a log-determinant function as a non-convex rank approximation which reduces the contributions of large singular values while keeping those of small singular values close to zero. We apply the method of augmented Lagrangian multipliers to optimize this non-convex rank approximation-based objective function and obtain closed-form solutions for all subproblems of minimizing different variables alternatively. The log-determinant low-rank optimization method is used to solve subspace clustering problem, for which we construct an affinity matrix based on the angular information of the low-rank representation to enhance its separability property. Extensive experimental results on face clustering and motion segmentation data demonstrate the effectiveness of the proposed method.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123597398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Momtazpour, Jinghe Zhang, S. Rahman, Ratnesh K. Sharma, Naren Ramakrishnan
{"title":"Analyzing Invariants in Cyber-Physical Systems using Latent Factor Regression","authors":"M. Momtazpour, Jinghe Zhang, S. Rahman, Ratnesh K. Sharma, Naren Ramakrishnan","doi":"10.1145/2783258.2788605","DOIUrl":"https://doi.org/10.1145/2783258.2788605","url":null,"abstract":"The analysis of large scale data logged from complex cyber-physical systems, such as microgrids, often entails the discovery of invariants capturing functional as well as operational relationships underlying such large systems. We describe a latent factor approach to infer invariants underlying system variables and how we can leverage these relationships to monitor a cyber-physical system. In particular we illustrate how this approach helps rapidly identify outliers during system operation.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125441735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taehwan Kim, Yisong Yue, Sarah L. Taylor, I. Matthews
{"title":"A Decision Tree Framework for Spatiotemporal Sequence Prediction","authors":"Taehwan Kim, Yisong Yue, Sarah L. Taylor, I. Matthews","doi":"10.1145/2783258.2783356","DOIUrl":"https://doi.org/10.1145/2783258.2783356","url":null,"abstract":"We study the problem of learning to predict a spatiotemporal output sequence given an input sequence. In contrast to conventional sequence prediction problems such as part-of-speech tagging (where output sequences are selected using a relatively small set of discrete labels), our goal is to predict sequences that lie within a high-dimensional continuous output space. We present a decision tree framework for learning an accurate non-parametric spatiotemporal sequence predictor. Our approach enjoys several attractive properties, including ease of training, fast performance at test time, and the ability to robustly tolerate corrupted training data using a novel latent variable approach. We evaluate on several datasets, and demonstrate substantial improvements over existing decision tree based sequence learning frameworks such as SEARN and DAgger.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125503579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Action Extraction for Random Forests and Boosted Trees","authors":"Zhicheng Cui, Wenlin Chen, Yujie He, Yixin Chen","doi":"10.1145/2783258.2783281","DOIUrl":"https://doi.org/10.1145/2783258.2783281","url":null,"abstract":"Additive tree models (ATMs) are widely used for data mining and machine learning. Important examples of ATMs include random forest, adaboost (with decision trees as weak learners), and gradient boosted trees, and they are often referred to as the best off-the-shelf classifiers. Though capable of attaining high accuracy, ATMs are not well interpretable in the sense that they do not provide actionable knowledge for a given instance. This greatly limits the potential of ATMs on many applications such as medical prediction and business intelligence, where practitioners need suggestions on actions that can lead to desirable outcomes with minimum costs. To address this problem, we present a novel framework to post-process any ATM classifier to extract an optimal actionable plan that can change a given input to a desired class with a minimum cost. In particular, we prove the NP-hardness of the optimal action extraction problem for ATMs and formulate this problem in an integer linear programming formulation which can be efficiently solved by existing packages. We also empirically demonstrate the effectiveness of the proposed framework by conducting comprehensive experiments on challenging real-world datasets.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126659245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Focusing on the Long-term: It's Good for Users and Business","authors":"Henning Hohnhold, Deirdre O'Brien, Diane Tang","doi":"10.1145/2783258.2788583","DOIUrl":"https://doi.org/10.1145/2783258.2788583","url":null,"abstract":"Over the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, i.e., the impact observed during the experiment. However, the short-term effect is not always predictive of the long-term effect, i.e., the final impact once the product has fully launched and users have changed their behavior in response. Thus, the challenge is how to determine the long-term user impact while still being able to make decisions in a timely manner. We tackle that challenge in this paper by first developing experiment methodology for quantifying long-term user learning. We then apply this methodology to ads shown on Google search, more specifically, to determine and quantify the drivers of ads blindness and sightedness, the phenomenon of users changing their inherent propensity to click on or interact with ads. We use these results to create a model that uses metrics measurable in the short-term to predict the long-term. We learn that user satisfaction is paramount: ads blindness and sightedness are driven by the quality of previously viewed or clicked ads, as measured by both ad relevance and landing page quality. Focusing on user satisfaction both ensures happier users but also makes business sense, as our results illustrate. We describe two major applications of our findings: a conceptual change to our search ads auction that further increased the importance of ads quality, and a 50% reduction of the ad load on Google's mobile search interface. The results presented in this paper are generalizable in two major ways. First, the methodology may be used to quantify user learning effects and to evaluate online experiments in contexts other than ads. Second, the ads blindness/sighted-ness results indicate that a focus on user satisfaction could help to reduce the ad load on the internet at large with long-term neutral, or even positive, business impact.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114953672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Jiawei Han
{"title":"ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering","authors":"Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Jiawei Han","doi":"10.1145/2783258.2783362","DOIUrl":"https://doi.org/10.1145/2783258.2783362","url":null,"abstract":"Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres---news, Yelp reviews and tweets---demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116138027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instance Weighting for Patient-Specific Risk Stratification Models","authors":"Jen J. Gong, T. Sundt, J. Rawn, J. Guttag","doi":"10.1145/2783258.2783397","DOIUrl":"https://doi.org/10.1145/2783258.2783397","url":null,"abstract":"Accurate risk models for adverse outcomes can provide important input to clinical decision-making. Surprisingly, one of the main challenges when using machine learning to build clinically useful risk models is the small amount of data available. Risk models need to be developed for specific patient populations, specific institutions, specific procedures, and specific outcomes. With each exclusion criterion, the amount of relevant training data decreases, until there is often an insufficient amount to learn an accurate model. This difficulty is compounded by the large class imbalance that is often present in medical applications. In this paper, we present an approach to address the problem of small data using transfer learning methods in the context of developing risk models for cardiac surgeries. We explore ways to build surgery-specific and hospital-specific models (the target task) using information from other kinds of surgeries and other hospitals (source tasks). We propose a novel method to weight examples based on their similarity to the target task training examples to take advantage of the useful examples while discounting less relevant ones. We show that incorporating appropriate source data in training can lead to improved performance over using only target task training data, and that our method of instance weighting can lead to further improvements. Applied to a surgical risk stratification task, our method, which used data from two institutions, performed comparably to the risk model published by the Society for Thoracic Surgeons, which was developed and tested on over one hundred thousand surgeries from hundreds of institutions.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116555789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}