{"title":"On Generating All Optimal Monotone Classifications","authors":"Luite Stegeman, A. Feelders","doi":"10.1109/ICDM.2011.111","DOIUrl":"https://doi.org/10.1109/ICDM.2011.111","url":null,"abstract":"In many applications of data mining one knows beforehand that the response variable should be monotone (either increasing or decreasing) in the attributes. In ordinal classification, changing the class labels of a data set (relabeling) so that the data becomes monotone, is useful for at least two reasons. Firstly, models trained on relabeled data tend to have better predictive performance than models trained on the original data. Secondly, relabeling is an important building block for the construction of monotone classifiers. However, optimal monotone relabelings are rarely unique, and so far an efficient algorithm to generate them all has been lacking. The main result of this paper is an efficient algorithm to produce the structure of all optimal monotone relabelings. We also show that counting the solutions is #P-complete and give algorithms for efficiently enumerating all solutions, as well as sampling uniformly from the set of solutions. Experiments show that relabeling non-monotone data can improve the predictive performance of models trained on that data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"101 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126275279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partitionable Kernels for Mapping Kernels","authors":"Kilho Shin","doi":"10.1109/ICDM.2011.115","DOIUrl":"https://doi.org/10.1109/ICDM.2011.115","url":null,"abstract":"Many of tree kernels in the literature are designed tanking advantage of the mapping kernel framework. The most important advantage of using this framework is that we have a strong theorem to examine positive definiteness of the resulting tree kernels. In the mapping kernel framework, each data object is viewed as a collection of components, and a mapping kernel for a pair of data objects is determined as a sum of kernel values of component pairs over a certain range determined according to the purpose of use of the resulting mapping kernel. For those tree kernels known to belong to the mapping kernel category, the string kernel of the product type is commonly used to compute the kernel values of component pairs. This is because it is known that use of the product-type string kernel together with the mapping kernel framework allows us to have recursive formulas to calculate the resulting tree kernels efficiently. We significantly generalizes this result. In fact, we show that we can use partition able kernels, a new class of string kernels instead of the product-type string kernel to enjoy the same advantage, that is, efficient computation based on recursive formulas. The class of partition able kernels is abundant, and contains the product-type string kernels just as an instance. Also, this result, not limited to tree kernels, can be applied to general mapping kernels after we formalize the decomposition properties of trees as the new notion of pretty decomposability.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114156919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arnaud Soulet, Chedy Raïssi, M. Plantevit, B. Crémilleux
{"title":"Mining Dominant Patterns in the Sky","authors":"Arnaud Soulet, Chedy Raïssi, M. Plantevit, B. Crémilleux","doi":"10.1109/ICDM.2011.100","DOIUrl":"https://doi.org/10.1109/ICDM.2011.100","url":null,"abstract":"Pattern discovery is at the core of numerous data mining tasks. Although many methods focus on efficiency in pattern mining, they still suffer from the problem of choosing a threshold that influences the final extraction result. The goal of our study is to make the results of pattern mining useful from a user-preference point of view. To this end, we integrate into the pattern discovery process the idea of skyline queries in order to mine skyline patterns in a threshold-free manner. Because the skyline patterns satisfy a formal property of dominations, they not only have a global interest but also have semantics that are easily understood by the user. In this work, we first establish theoretical relationships between pattern condensed representations and skyline pattern mining. We also show that it is possible to compute automatically a subset of measures involved in the user query which allows the patterns to be condensed and thus facilitates the computation of the skyline patterns. This forms the basis for a novel approach to mining skyline patterns. We illustrate the efficiency of our approach over several data sets including a use case from chemo informatics and show that small sets of dominant patterns are produced under various measures.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121142313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Product Classification Using Images","authors":"A. Kannan, P. Talukdar, Nikhil Rasiwasia, Qifa Ke","doi":"10.1109/ICDM.2011.79","DOIUrl":"https://doi.org/10.1109/ICDM.2011.79","url":null,"abstract":"Product classification in Commerce search (eg{} Google Product Search, Bing Shopping) involves associating categories to offers of products from a large number of merchants. The categorized offers are used in many tasks including product taxonomy browsing and matching merchant offers to products in the catalog. Hence, learning a product classifier with high precision and recall is of fundamental importance in order to provide high quality shopping experience. A product offer typically consists of a short textual description and an image depicting the product. Traditional approaches to this classification task is to learn a classifier using only the textual descriptions of the products. In this paper, we show that the use of images, a weaker signal in our setting, in conjunction with the textual descriptions, a more discriminative signal, can considerably improve the precision of the classification task, irrespective of the type of classifier being used. We present a novel classification approach, Cross Adapt{} (CrossAdaptAcro{}), that is cognizant of the disparity in the discriminative power of different types of signals and hence makes use of the confusion matrix of dominant signal (text in our setting) to prudently leverage the weaker signal (image), for an improved performance. Our evaluation performed on data from a major Commerce search engine's catalog shows a 12% (absolute) improvement in precision at 100% coverage, and a 16% (absolute) improvement in recall at 90% precision compared to classifiers that only use textual description of products. In addition, CrossAdaptAcro{} also provides a more accurate classifier based only on the dominant signal (text) that can be used in situations in which only the dominant signal is available during application time.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115967844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning from Negative Examples in Set-Expansion","authors":"Prateek Jindal, D. Roth","doi":"10.1109/ICDM.2011.86","DOIUrl":"https://doi.org/10.1109/ICDM.2011.86","url":null,"abstract":"This paper addresses the task of set-expansion on free text. Set-expansion has been viewed as a problem of generating an extensive list of instances of a concept of interest, given a few examples of the concept as input. Our key contribution is that we show that the concept definition can be significantly improved by specifying some negative examples in the input, along with the positive examples. The state-of-the art centroid-based approach to set-expansion doesn't readily admit the negative examples. We develop an inference-based approach to set-expansion which naturally allows for negative examples and show that it performs significantly better than a strong baseline.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123881243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Robust Clustering Algorithm Based on Aggregated Heat Kernel Mapping","authors":"Hao Huang, Shinjae Yoo, Hong Qin, Dantong Yu","doi":"10.1109/ICDM.2011.15","DOIUrl":"https://doi.org/10.1109/ICDM.2011.15","url":null,"abstract":"Current spectral clustering algorithms suffer from both sensitivity to scaling parameter selection in similarity matrix construction, and data perturbation. This paper aims to improve robustness in clustering algorithms and combat these two limitations based on heat kernel theory. Heat kernel can statistically depict traces of random walk, so it has an intrinsic connection with diffusion distance, with which we can ensure robustness during any clustering process. By integrating heat distributed along time scale, we propose a novel method called Aggregated Heat Kernel (AHK) to measure the distance between each point pair in their eigen space. Using AHK and Laplace-Beltrami Normalization (LBN) we are able to apply an advanced noise-resisting robust spectral mapping to original dataset. Moreover it offers stability on scaling parameter tuning. Experimental results show that, compared to other popular spectral clustering methods, our algorithm can achieve robust clustering results on both synthetic and UCI real datasets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131175383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Frequent Closed Itemsets for Data Dimensionality Reduction","authors":"Petr Krajča, Jan Outrata, Vilém Vychodil","doi":"10.1109/ICDM.2011.154","DOIUrl":"https://doi.org/10.1109/ICDM.2011.154","url":null,"abstract":"We address important issues of dimensionality reduction of transactional data sets where the input data consists of lists of transactions, each of them being a finite set of items. The reduction consists in finding a small set of new items, so-called factor-items, which is considerably smaller than the original set of items while comprising full or nearly full information about the original items. Using this type of reduction, the original data set can be represented by a smaller transactional data set using factor-items instead of the original items, thus reducing its dimensionality. The procedure utilized in this paper is based on approximate Boolean matrix decomposition. In this paper, we focus on the role of frequent closed item sets that can be used to determine factor-items. We present the factorization problem, its reduction to Boolean matrix decompositions, experiments with publicly available data sets, and an algorithm for computing decompositions.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132750470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovery of Versatile Temporal Subspace Patterns in 3-D Datasets","authors":"Zhen Hu, R. Bhatnagar","doi":"10.1109/ICDM.2011.56","DOIUrl":"https://doi.org/10.1109/ICDM.2011.56","url":null,"abstract":"Most existing methods for clustering temporal data are based on either a strict similarity metric or a precisely defined temporal profile such as a sine, exponential wave etc. Also, these methods compute similarity metric across the entire time-span of the objects. However these types of temporal patterns are more useful in many biological analysis, where it is important to observe gene expression pattens across arbitrary subintervals. These types of temporal patterns are very useful in bioinformatics, where it is important to observe gene expression pattens across arbitrary subintervals. In this paper we present an algorithm for searching for multiple contiguous temporal subintervals within which the selected objects demonstrate existence of clear patterns. We demonstrate the power and advantages of our algorithm by using a synthetic dataset and a pharmacokinetics dataset for which other researchers have recently published their results. We compare and contrast our results with these results to show superiority of our approach.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134012651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-supervised Hierarchical Clustering","authors":"Li Zheng, Tao Li","doi":"10.1109/ICDM.2011.130","DOIUrl":"https://doi.org/10.1109/ICDM.2011.130","url":null,"abstract":"Semi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional clustering methods and few research efforts have been reported on semi-supervised hierarchical clustering methods. In addition, current semi-supervised clustering methods have been focused on the use of background information in the form of instance level must-link and cannot-link constraints, which are not suitable for hierarchical clustering where data objects are linked over different hierarchy levels. In this paper, we propose a novel semi-supervised hierarchical clustering framework based on ultra-metric dendrogram distance. The proposed framework is able to incorporate triple-wise relative constraints. We establish the connection between hierarchical clustering and ultra-metric transformation of dissimilarity matrix and propose two techniques (the constrained optimization technique and the transitive dissimilarity based technique) for semi-supervised hierarchical clustering. Experimental results demonstrate the effectiveness and the efficiency of our proposed methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131825394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Temporal Link Prediction","authors":"S. Oyama, K. Hayashi, H. Kashima","doi":"10.1109/ICDM.2011.45","DOIUrl":"https://doi.org/10.1109/ICDM.2011.45","url":null,"abstract":"The increasing interest in dynamically changing networks has led to growing interest in a more general link prediction problem called temporal link prediction in the data mining and machine learning communities. However, only links in identical time frames are considered in temporal link prediction. We propose a new link prediction problem called cross-temporal link prediction in which the links among nodes in different time frames are inferred. A typical example of cross-temporal link prediction is cross-temporal entity resolution to determine the identity of real entities represented by data objects observed in different time periods. In dynamic environments, the features of data change over time, making it difficult to identify cross-temporal links by directly comparing observed data. Other examples of cross-temporal links are asynchronous communications in social networks such as Face book and Twitter, where a message is posted in reply to a previous message. We adopt a dimension reduction approach to cross-temporal link prediction, that is, data objects in different time frames are mapped into a common low-dimensional latent feature space, and the links are identified on the basis of the distance between the data objects. The proposed method uses different low-dimensional feature projections in different time frames, enabling it to adapt to changes in the latent features over time. Using multi-task learning, it jointly learns a set of feature projection matrices from the training data, given the assumption of temporal smoothness of the projections. The optimal solutions are obtained by solving a single generalized eigenvalue problem. Experiments using a real-world set of bibliographic data for cross-temporal entity resolution showed that introducing time-dependent feature projections improves the accuracy of link prediction.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131719765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}