I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero
{"title":"Clustering categorical data: A stability analysis framework","authors":"I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero","doi":"10.1109/CIDM.2011.5949452","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949452","url":null,"abstract":"Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127129739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KB-CB-N classification: Towards unsupervised approach for supervised learning","authors":"Z. Abdallah, M. Gaber","doi":"10.1109/CIDM.2011.5949435","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949435","url":null,"abstract":"Data classification has attracted considerable research attention in the field of computational statistics and data mining due to its wide range of applications. K Best Cluster Based Neighbour (KB-CB-N) is our novel classification technique based on the integration of three different similarity measures for cluster based classification. The basic principle is to apply unsupervised learning on the instances of each class in the dataset and then use the output as an input for the classification algorithm to find the K best neighbours of clusters from the density, gravity and distance perspectives. Clustering is applied as an initial step within each class to find the inherent in-class grouping in the dataset. Different data clustering techniques use different similarity measures. Each measure has its own strength and weakness. Thus, combining the three measures can benefit from the strength of each one and eliminate encountered problems of using an individual measure. Extensive experimental results using eight real datasets have evidenced that our new technique typically shows improved or equivalent performance over other existing state-of-the-art classification methods.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133416846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online autoregressive prediction in time series with delayed disclosure","authors":"J. Andreoli, Marie-Luise Schneider","doi":"10.1109/CIDM.2011.5949440","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949440","url":null,"abstract":"We propose a supervised machine learning method to automate the classification of events within time series in a monitoring context. It is based on a generative stochastic model of the time series which combines a probabilistic autoregressive classifier to determine the class label of each event, and a hidden Markov model to capture the production of the events. Events can be described by arbitrary combinations of discrete and continuous features. While at training time (offline), it is assumed that the class labels of all the events are known, at inference time (online), when a prediction is to be made for an event, it is not assumed that the class labels of the preceding events are known. This makes prediction more complex due to the autoregressive nature of the model. Instead, we make and exploit a “delayed disclosure” assumption, namely that the class labels of all the events are eventually revealed, but the occurrence of an event and the revelation of its class are asynchronous. We report experimental results obtained by application of this approach to the monitoring of a fleet of distributed devices.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"246 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121920908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partially supervised k-harmonic means clustering","authors":"T. Runkler","doi":"10.1109/CIDM.2011.5949424","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949424","url":null,"abstract":"A popular algorithm for finding clusters in unlabeled data optimizes the k-means clustering model. This algorithm converges quickly but is sensitive to initialization. Two ways to overcome this drawback are fuzzification and harmonic means. We show that k-harmonic means is a special case of reformulated fuzzy k-means. The main focus of this paper is on partially supervised clustering. Partially supervised clustering finds clusters in data sets that contain both unlabeled and labeled data. We review partially supervised k-means, partially supervised fuzzy k-means, and introduce a partially supervised extension of k-harmonic means. Experiments with four benchmark data sets indicate that partially supervised k-harmonic means inherits the advantages of its completely unsupervised variant: It is significantly less sensitive to initialization than partially supervised k-means.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121478306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increased classification accuracy and speedup through pair-wise feature selection for support vector machines","authors":"K. Kramer, Dmitry Goldgof, L. Hall, A. Remsen","doi":"10.1109/CIDM.2011.5949457","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949457","url":null,"abstract":"Support vector machines are binary classifiers that can implement multi-class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy. Feature selection algorithms often search for a single set of features to be used by each of the binary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result, the feature selection process may not include these features in the common set to be used by all support vector machines. It is shown that by selecting features for each binary class combination, overall classification accuracy can be improved (as much as 2.1%), feature selection time can be significantly reduced (speed up of 3.2 times), and time required for training a multi-class support vector machine is reduced. Another benefit of this approach is that considerably less time is required for feature selection when additional classes are added to the training data. This is because the features selected for the existing class combinations are still valid, so that feature selection only needs to be run for the new class combinations created.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129496053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partial generalized correlation for hyperspectral data","authors":"M. Strickert, B. Labitzke, V. Blanz","doi":"10.1109/CIDM.2011.5949422","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949422","url":null,"abstract":"A variational approach is proposed for the unsupervised assessment of attribute variability of high-dimensional data given a differentiable similarity measure. The key question addressed is how much each data attribute contributes to an optimum transformation of vectors for reaching maximum similarity. This question is formalized and solved in a mathematically rigorous optimization framework for each data pair of interest. Trivially, for the Euclidean metric minimization to zero distance induces highest vector similarity, but in case of the linear Pearson correlation measure the highest similarity of one is desired. During optimization the not necessarily symmetric trajectories between two vectors are recorded and analyzed in terms of attribute changes and line integral. The proposed formalism allows to assess partial covariance and correlation characteristics of data attributes for vectors being compared by any differentiable similarity measure. Its potential for generating alternative and localized views such as for contrast enhancement is demonstrated for hyperspectral images from the remote sensing domain.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133717725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Periodic quick test for classifying long-term activities","authors":"Pekka Siirtola, Heli Koskimäki, J. Röning","doi":"10.1109/CIDM.2011.5949426","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949426","url":null,"abstract":"A novel method to classify long-term human activities is presented in this study. The method consists of two parts: quick test and periodic classification. The quick test uses temporal information to improve recognition accuracy, while the periodic classification is based on the assumption that recognized activities are long-term. Periodic quick test (PQT) classification was tested using a data set consisting of six long-term sports exercises. The data were collected from six persons wearing a two-dimensional accelerometer on their wrist. The results show that the presented method is not only faster than a normal method, that does not use temporal information and does not assume that activities are long-term, but also more accurate. The results were compared with a normal sliding window technique which divides signal into smaller sequences and classifies each sequence into one of the six classes. The classification accuracy using a normal method was around 84% while using PQT the recognition rate was over 90%. In addition, the number of classified sequences using a normal method was over six times higher than using PQT.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116384545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FGMAC: Frequent subgraph mining with Arc Consistency","authors":"Brahim Douar, M. Liquiere, C. Latiri, Y. Slimani","doi":"10.1109/CIDM.2011.5949436","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949436","url":null,"abstract":"With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NP-Completeness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern's quality.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131819513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Wang, Min Lu, X. Pang, Maoqiang Xie, Yalou Huang
{"title":"Multiple query-dependent RankSVM aggregation for document retrieval","authors":"Yang Wang, Min Lu, X. Pang, Maoqiang Xie, Yalou Huang","doi":"10.1109/CIDM.2011.5949420","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949420","url":null,"abstract":"This paper is concerned with supervised rank aggregation, which aims to improve the ranking performance by combining the outputs from multiple rankers. However, there are two main shortcomings in previous rank aggregation approaches. Firstly, the learned weights for base rankers do not distinguish the differences among queries. This is suboptimal since queries vary significantly in terms of ranking. Besides, most current aggregation functions are unsupervised. A supervised aggregation function could further improve the ranking performance. In this paper, the significant difference existing among queries is taken into consideration, and a supervised rank aggregation approach is proposed. As a case study, we employ RankSVM model to aggregate the base rankers, referred to as Q.D.RSVM, and prove that Q.D.RSVM can set up query-dependent weights for different base rankers. Experimental results based on benchmark datasets show our approach outperforms conventional ranking approaches.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125509437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A GPU-based interactive bio-inspired visual clustering","authors":"U. Erra, Bernardino Frola, V. Scarano","doi":"10.1109/CIDM.2011.5949300","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949300","url":null,"abstract":"In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"511 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130005845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}