Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach 面向主题层次的交互构建:递归张量分解方法

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783288

Chi Wang, Xueqing Liu, Yanglei Song, Jiawei Han

引用次数: 10

User Modeling in Telecommunications and Internet Industry 电信和互联网行业的用户建模

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2790459

Qiang Yang

引用次数: 0

Entity Matching across Heterogeneous Sources 跨异构源的实体匹配

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783353

Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li

{"title":"Entity Matching across Heterogeneous Sources","authors":"Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li","doi":"10.1145/2783258.2783353","DOIUrl":"https://doi.org/10.1145/2783258.2783353","url":null,"abstract":"Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Subspace Clustering Using Log-determinant Rank Approximation 使用对数行列式秩近似的子空间聚类

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783303

Chong Peng, Zhao Kang, Huiqing Li, Q. Cheng

{"title":"Subspace Clustering Using Log-determinant Rank Approximation","authors":"Chong Peng, Zhao Kang, Huiqing Li, Q. Cheng","doi":"10.1145/2783258.2783303","DOIUrl":"https://doi.org/10.1145/2783258.2783303","url":null,"abstract":"A number of machine learning and computer vision problems, such as matrix completion and subspace clustering, require a matrix to be of low-rank. To meet this requirement, most existing methods use the nuclear norm as a convex proxy of the rank function and minimize it. However, the nuclear norm simply adds all nonzero singular values together instead of treating them equally as the rank function does, which may not be a good rank approximation when some singular values are very large. To reduce this undesirable weighting effect, we use a log-determinant function as a non-convex rank approximation which reduces the contributions of large singular values while keeping those of small singular values close to zero. We apply the method of augmented Lagrangian multipliers to optimize this non-convex rank approximation-based objective function and obtain closed-form solutions for all subproblems of minimizing different variables alternatively. The log-determinant low-rank optimization method is used to solve subspace clustering problem, for which we construct an affinity matrix based on the angular information of the low-rank representation to enhance its separability property. Extensive experimental results on face clustering and motion segmentation data demonstrate the effectiveness of the proposed method.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123597398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Analyzing Invariants in Cyber-Physical Systems using Latent Factor Regression 利用潜在因子回归分析信息物理系统的不变量

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2788605

M. Momtazpour, Jinghe Zhang, S. Rahman, Ratnesh K. Sharma, Naren Ramakrishnan

引用次数: 24

A Decision Tree Framework for Spatiotemporal Sequence Prediction 时空序列预测的决策树框架

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783356

Taehwan Kim, Yisong Yue, Sarah L. Taylor, I. Matthews

引用次数: 62

Optimal Action Extraction for Random Forests and Boosted Trees

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783281

Zhicheng Cui, Wenlin Chen, Yujie He, Yixin Chen

{"title":"Optimal Action Extraction for Random Forests and Boosted Trees","authors":"Zhicheng Cui, Wenlin Chen, Yujie He, Yixin Chen","doi":"10.1145/2783258.2783281","DOIUrl":"https://doi.org/10.1145/2783258.2783281","url":null,"abstract":"Additive tree models (ATMs) are widely used for data mining and machine learning. Important examples of ATMs include random forest, adaboost (with decision trees as weak learners), and gradient boosted trees, and they are often referred to as the best off-the-shelf classifiers. Though capable of attaining high accuracy, ATMs are not well interpretable in the sense that they do not provide actionable knowledge for a given instance. This greatly limits the potential of ATMs on many applications such as medical prediction and business intelligence, where practitioners need suggestions on actions that can lead to desirable outcomes with minimum costs. To address this problem, we present a novel framework to post-process any ATM classifier to extract an optimal actionable plan that can change a given input to a desired class with a minimum cost. In particular, we prove the NP-hardness of the optimal action extraction problem for ATMs and formulate this problem in an integer linear programming formulation which can be efficiently solved by existing packages. We also empirically demonstrate the effectiveness of the proposed framework by conducting comprehensive experiments on challenging real-world datasets.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126659245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 100

Focusing on the Long-term: It's Good for Users and Business 着眼长远:这对用户和企业都有好处

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2788583

Henning Hohnhold, Deirdre O'Brien, Diane Tang

{"title":"Focusing on the Long-term: It's Good for Users and Business","authors":"Henning Hohnhold, Deirdre O'Brien, Diane Tang","doi":"10.1145/2783258.2788583","DOIUrl":"https://doi.org/10.1145/2783258.2788583","url":null,"abstract":"Over the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, i.e., the impact observed during the experiment. However, the short-term effect is not always predictive of the long-term effect, i.e., the final impact once the product has fully launched and users have changed their behavior in response. Thus, the challenge is how to determine the long-term user impact while still being able to make decisions in a timely manner. We tackle that challenge in this paper by first developing experiment methodology for quantifying long-term user learning. We then apply this methodology to ads shown on Google search, more specifically, to determine and quantify the drivers of ads blindness and sightedness, the phenomenon of users changing their inherent propensity to click on or interact with ads. We use these results to create a model that uses metrics measurable in the short-term to predict the long-term. We learn that user satisfaction is paramount: ads blindness and sightedness are driven by the quality of previously viewed or clicked ads, as measured by both ad relevance and landing page quality. Focusing on user satisfaction both ensures happier users but also makes business sense, as our results illustrate. We describe two major applications of our findings: a conceptual change to our search ads auction that further increased the importance of ads quality, and a 50% reduction of the ad load on Google's mobile search interface. The results presented in this paper are generalizable in two major ways. First, the methodology may be used to quantify user learning effects and to evaluate online experiments in contexts other than ads. Second, the ads blindness/sighted-ness results indicate that a focus on user satisfaction could help to reduce the ad load on the internet at large with long-term neutral, or even positive, business impact.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114953672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering 基于关系短语聚类的有效实体识别和分类

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783362

Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Jiawei Han

{"title":"ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering","authors":"Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Jiawei Han","doi":"10.1145/2783258.2783362","DOIUrl":"https://doi.org/10.1145/2783258.2783362","url":null,"abstract":"Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres---news, Yelp reviews and tweets---demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116138027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

Instance Weighting for Patient-Specific Risk Stratification Models 特定患者风险分层模型的实例加权

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI: 10.1145/2783258.2783397

Jen J. Gong, T. Sundt, J. Rawn, J. Guttag

{"title":"Instance Weighting for Patient-Specific Risk Stratification Models","authors":"Jen J. Gong, T. Sundt, J. Rawn, J. Guttag","doi":"10.1145/2783258.2783397","DOIUrl":"https://doi.org/10.1145/2783258.2783397","url":null,"abstract":"Accurate risk models for adverse outcomes can provide important input to clinical decision-making. Surprisingly, one of the main challenges when using machine learning to build clinically useful risk models is the small amount of data available. Risk models need to be developed for specific patient populations, specific institutions, specific procedures, and specific outcomes. With each exclusion criterion, the amount of relevant training data decreases, until there is often an insufficient amount to learn an accurate model. This difficulty is compounded by the large class imbalance that is often present in medical applications. In this paper, we present an approach to address the problem of small data using transfer learning methods in the context of developing risk models for cardiac surgeries. We explore ways to build surgery-specific and hospital-specific models (the target task) using information from other kinds of surgeries and other hospitals (source tasks). We propose a novel method to weight examples based on their similarity to the target task training examples to take advantage of the useful examples while discounting less relevant ones. We show that incorporating appropriate source data in training can lead to improved performance over using only target task training data, and that our method of instance weighting can lead to further improvements. Applied to a surgical risk stratification task, our method, which used data from two institutions, performed comparably to the risk model published by the Society for Thoracic Surgeons, which was developed and tested on over one hundred thousand surgeries from hundreds of institutions.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116555789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14