Hyungsul Kim, Xiang Ren, Yizhou Sun, Chi Wang, Jiawei Han
{"title":"Semantic Frame-Based Document Representation for Comparable Corpora","authors":"Hyungsul Kim, Xiang Ren, Yizhou Sun, Chi Wang, Jiawei Han","doi":"10.1109/ICDM.2013.99","DOIUrl":"https://doi.org/10.1109/ICDM.2013.99","url":null,"abstract":"Document representation is a fundamental problem for text mining. Many efforts have been done to generate concise yet semantic representation, such as bag-of-words, phrase, sentence and topic-level descriptions. Nevertheless, most existing techniques counter difficulties in handling monolingual comparable corpus, which is a collection of monolingual documents conveying the same topic. In this paper, we propose the use of frame, a high-level semantic unit, and construct frame-based representations to semantically describe documents by bags of frames, using an information network approach. One major challenge in this representation is that semantically similar frames may be of different forms. For example, \"radiation leaked\" in one news article can appear as \"the level of radiation increased\" in another article. To tackle the problem, a text-based information network is constructed among frames and words, and a link-based similarity measure called SynRank is proposed to calculate similarity between frames. As a result, different variations of the semantically similar frames are merged into a single descriptive frame using clustering, and a document can then be represented as a bag of representative frames. It turns out that frame-based document representation not only is more interpretable, but also can facilitate other text analysis tasks such as event tracking effectively. We conduct both qualitative and quantitative experiments on three comparable news corpora, to study the effectiveness of frame-based document representation and the similarity measure SynRank, respectively, and demonstrate that the superior performance of frame-based document representation on different real-world applications.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123400618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed K. Farahat, Ahmed Elgohary, A. Ghodsi, M. Kamel
{"title":"Distributed Column Subset Selection on MapReduce","authors":"Ahmed K. Farahat, Ahmed Elgohary, A. Ghodsi, M. Kamel","doi":"10.1109/ICDM.2013.155","DOIUrl":"https://doi.org/10.1109/ICDM.2013.155","url":null,"abstract":"Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. The paper first formulates the problem as the selection of a few representative columns from a matrix whose columns are massively distributed, and it then proposes a MapReduce algorithm for selecting those representatives. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper then demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126509101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nonlinear Causal Discovery for High Dimensional Data: A Kernelized Trace Method","authors":"Zhitang Chen, Kun Zhang, L. Chan","doi":"10.1109/ICDM.2013.103","DOIUrl":"https://doi.org/10.1109/ICDM.2013.103","url":null,"abstract":"Causal discovery for high-dimensional observations is a useful tool in many fields such as climate analysis and financial market analysis. A linear Trace method has been proposed to identify the causal direction between two linearly coupled high-dimensional observations X and Y. However, in reality, the relations between X and Y are usually nonlinear and consequently the linear Trace method may fail. In this paper, we propose a method to infer the nonlinear causal relations for two high-dimensional observations X and Y. The idea is to map the observations to high dimensional Reproducing Kernel Hilbert Space (RKHS) such that the nonlinear relations become simple linear ones. We show that the linear Trace condition holds for the causal direction but it is violated for the anti-causal direction in RKHS. Based on this theoretical result, we develop a simple algorithm to infer the causal direction for nonlinearly coupled causal pairs. Synthetic data and real world data experiments are conducted to show the effectiveness of our proposed method.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121508751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Social User Activities to People Affiliation","authors":"Guangxiang Zeng, Ping Luo, Enhong Chen, Min Wang","doi":"10.1109/ICDM.2013.101","DOIUrl":"https://doi.org/10.1109/ICDM.2013.101","url":null,"abstract":"This study addresses the problem of inferring users' employment affiliation information from social activities. It is motivated by the applications which need to monitoring and analyzing the social activities of the employees from a given company, especially their social tracks related to the work and business. It definitely helps to better understand their needs and opinions towards certain business area, so that the account sales targeting these customers in the given company can adjust the sales strategies accordingly. Specifically, in this task we are given a snapshot of a social network and some labeled social users who are the employees of a given company. Our goal is to identify more users from the same company. We formulate this problem as a task of classifying nodes over a graph, and develop a Supervised Label Propagation model. It naturally incorporates the rich set of features for social activities, models the networking effect by label propagation, and learns the feature weights so that the labels are propagated to the right users. To validate its effectiveness, we show our case studies on identifying the employees of \"China Telecom\" and \"China Unicom\" from Sina Weibo. The experimental results show that our method significantly outperforms the compared baseline ones.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121605554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient and Scalable Information Geometry Metric Learning","authors":"Wei Wang, Bao-Gang Hu, Zengfu Wang","doi":"10.1109/ICDM.2013.67","DOIUrl":"https://doi.org/10.1109/ICDM.2013.67","url":null,"abstract":"Information Geometry Metric Learning (IGML) is shown to be an effective algorithm for distance metric learning. In this paper, we attempt to alleviate two limitations of IGML: (A) the time complexity of IGML increases rapidly for high dimensional data, (B) IGML has to transform the input low rank kernel into a full-rank one since it is undefined for singular matrices. To this end, two novel algorithms, referred to as Efficient Information Geometry Metric Learning (EIGML) and Scalable Information Geometry Metric Learning (SIGML), are proposed. EIGML scales linearly with the dimensionality, resulting in significantly reduced computational complexity. As for SIGML, it is proven to have a range-space preserving property. Following this property, SIGML is found to be capable of handling both full-rank and low-rank kernels. Additionally, the geometric information from data is further exploited in SIGML. In contrast to most existing metric learning methods, both EIGML and SIGML have closed-form solutions and can be efficiently optimized. Experimental results on various data sets demonstrate that the proposed methods outperform the state-of-the-art metric learning algorithms.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131519210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cox Regression with Correlation Based Regularization for Electronic Health Records","authors":"B. Vinzamuri, C. Reddy","doi":"10.1109/ICDM.2013.89","DOIUrl":"https://doi.org/10.1109/ICDM.2013.89","url":null,"abstract":"Survival Regression models play a vital role in analyzing time-to-event data in many practical applications ranging from engineering to economics to healthcare. These models are ideal for prediction in complex data problems where the response is a time-to-event variable. An event is defined as the occurrence of a specific event of interest such as a chronic health condition. Cox regression is one of the most popular survival regression model used in such applications. However, these models have the tendency to over fit the data which is not desirable for healthcare applications because it limits their generalization to other hospital scenarios. In this paper, we address these challenges for the cox regression model. We combine two unique correlation based regularizers with cox regression to handle correlated and grouped features which are commonly seen in many practical problems. The proposed optimization problems are solved efficiently using cyclic coordinate descent and Alternate Direction Method of Multipliers algorithms. We conduct experimental analysis on the performance of these algorithms over several synthetic datasets and electronic health records (EHR) data about heart failure diagnosed patients from a hospital. We demonstrate through our experiments that these regularizers effectively enhance the ability of cox regression to handle correlated features. In addition, we extensively compare our results with other regularized linear and logistic regression algorithms. We validate the goodness of the features selected by these regularized cox regression models using the biomedical literature and different feature selection algorithms.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132543555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Qian, Xiang Wang, Jun Wang, Hongfei Li, Nan Cao, Weifeng Zhi, I. Davidson
{"title":"Fast Pairwise Query Selection for Large-Scale Active Learning to Rank","authors":"B. Qian, Xiang Wang, Jun Wang, Hongfei Li, Nan Cao, Weifeng Zhi, I. Davidson","doi":"10.1109/ICDM.2013.54","DOIUrl":"https://doi.org/10.1109/ICDM.2013.54","url":null,"abstract":"Pair wise learning to rank algorithms (such as Rank SVM) teach a machine how to rank objects given a collection of ordered object pairs. However, their accuracy is highly dependent on the abundance of training data. To address this limitation and reduce annotation efforts, the framework of active pair wise learning to rank was introduced recently. However, in such a framework the number of possible query pairs increases quadratic ally with the number of instances. In this work, we present the first scalable pair wise query selection method using a layered (two-step) hashing framework. The first step relevance hashing aims to retrieve the strongly relevant or highly ranked points, and the second step uncertainty hashing is used to nominate pairs whose ranking is uncertain. The proposed framework aims to efficiently reduce the search space of pair wise queries and can be used with any pair wise learning to rank algorithm with a linear ranking function. We evaluate our approach on large-scale real problems and show it has comparable performance to exhaustive search. The experimental results demonstrate the effectiveness of our approach, and validate the efficiency of hashing in accelerating the search of massive pair wise queries.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114417267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Non-redundant Overlapping Biclusters on Gene Expression Data","authors":"Duy Tin Truong, R. Battiti, M. Brunato","doi":"10.1109/ICDM.2013.36","DOIUrl":"https://doi.org/10.1109/ICDM.2013.36","url":null,"abstract":"Given a gene expression data matrix where each cell is the expression level of a gene under a certain condition, biclustering is the problem of searching for a subset of genes that co regulate and co express only under a subset of conditions. As some genes can belong to different functional categories, searching for non-redundant overlapping biclusters is an important problem in biclustering. However, most recent algorithms can only either produce disjoint biclusters or redundant biclusters with significant overlap. In other words, these algorithms do not allow users to specify the maximum overlap between the biclusters. In this paper, we propose a novel algorithm which can generate K overlapping biclusters where the maximum overlap between them is below a predefined threshold. Unlike the other approaches which often generate all biclusters at once, our algorithm produces the biclusters sequentially, where each newly generated bicluster is guaranteed to be different from the previous ones but can still overlap with them. The experiments on real datasets confirm that different meaningful overlapping biclusters are successfully discovered. Besides, under the same constraints, our algorithm returns much larger and higher-quality biclusters compared to those of the other state-of-the art algorithms.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121578243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conformal Prediction Using Decision Trees","authors":"U. Johansson, Henrik Boström, Tuwe Löfström","doi":"10.1109/ICDM.2013.85","DOIUrl":"https://doi.org/10.1109/ICDM.2013.85","url":null,"abstract":"Conformal prediction is a relatively new framework in which the predictive models output sets of predictions with a bound on the error rate, i.e., in a classification context, the probability of excluding the correct class label is lower than a predefined significance level. An investigation of the use of decision trees within the conformal prediction framework is presented, with the overall purpose to determine the effect of different algorithmic choices, including split criterion, pruning scheme and way to calculate the probability estimates. Since the error rate is bounded by the framework, the most important property of conformal predictors is efficiency, which concerns minimizing the number of elements in the output prediction sets. Results from one of the largest empirical investigations to date within the conformal prediction framework are presented, showing that in order to optimize efficiency, the decision trees should be induced using no pruning and with smoothed probability estimates. The choice of split criterion to use for the actual induction of the trees did not turn out to have any major impact on the efficiency. Finally, the experimentation also showed that when using decision trees, standard inductive conformal prediction was as efficient as the recently suggested method cross-conformal prediction. This is an encouraging results since cross-conformal prediction uses several decision trees, thus sacrificing the interpretability of a single decision tree.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131365807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-instance Multi-graph Dual Embedding Learning","authors":"Jia Wu, Xingquan Zhu, Chengqi Zhang, Z. Cai","doi":"10.1109/ICDM.2013.121","DOIUrl":"https://doi.org/10.1109/ICDM.2013.121","url":null,"abstract":"Multi-instance learning concerns about building learning models from a number of labeled instance bags, where each bag consists of instances with unknown labels. A bag is labeled positive if one or more multiple instances inside the bag is positive, and negative otherwise. For all existing multi-instance learning algorithms, they are only applicable to the setting where instances in each bag are represented by a set of well defined feature values. In this paper, we advance the problem to a multi-instance multi-graph setting, where a bag contains a number of instances and graphs in pairs, and the learning objective is to derive classification models from labeled bags, containing both instances and graphs, to predict previously unseen bags with maximum accuracy. To achieve the goal, the main challenge is to properly represent graphs inside each bag and further take advantage of complementary information between instance and graph pairs for learning. In the paper, we propose a Dual Embedding Multi-Instance Multi-Graph Learning (DE-MIMG) algorithm, which employs a dual embedding learning approach to (1) embed instance distributions into the informative sub graphs discovery process, and (2) embed discovered sub graphs into the instance feature selection process. The dual embedding process results in an optimal representation for each bag to provide combined instance and graph information for learning. Experiments and comparisons on real-world multi-instance multi-graph learning tasks demonstrate the algorithm performance.","PeriodicalId":308676,"journal":{"name":"2013 IEEE 13th International Conference on Data Mining","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116457950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}