{"title":"A Randomized Approach for Approximating the Number of Frequent Sets","authors":"Mario Boley, H. Grosskreutz","doi":"10.1109/ICDM.2008.85","DOIUrl":"https://doi.org/10.1109/ICDM.2008.85","url":null,"abstract":"We investigate the problem of counting the number of frequent (item)sets - a problem known to be intractable in terms of an exact polynomial time computation. In this paper, we show that it is in general also hard to approximate. Subsequently, a randomized counting algorithm is developed using the Markov chain Monte Carlo method. While for general inputs an exponential running time is needed in order to guarantee a certain approximation bound, we empirically show that the algorithm still has the desired accuracy on real-world datasets when its running time is capped polynomially.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122934017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SeqStream: Mining Closed Sequential Patterns over Stream Sliding Windows","authors":"Lei Chang, Tengjiao Wang, Dongqing Yang, Hua Luan","doi":"10.1109/ICDM.2008.36","DOIUrl":"https://doi.org/10.1109/ICDM.2008.36","url":null,"abstract":"Previous studies have shown mining closed patterns provides more benefits than mining the complete set of frequent patterns, since closed pattern mining leads to more compact results and more efficient algorithms. It is quite useful in a data stream environment where memory and computation power are major concerns. This paper studies the problem of mining closed sequential patterns over data stream sliding windows. A synopsis structure IST (Inverse Closed Sequence Tree) is designed to keep inverse closed sequential patterns in current window. An efficient algorithm SeqStream is developed to mine closed sequential patterns in stream windows incrementally, and various novel strategies are adopted in SeqStream to prune search space aggressively. Extensive experiments on both real and synthetic data sets show that SeqStream outperforms PrefixSpan, CloSpan and BIDE by a factor of about one to two orders of magnitude.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"27 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113931693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Lin, Bolin Ding, Jiawei Han, Feida Zhu, Bo Zhao
{"title":"Text Cube: Computing IR Measures for Multidimensional Text Database Analysis","authors":"C. Lin, Bolin Ding, Jiawei Han, Feida Zhu, Bo Zhao","doi":"10.1109/ICDM.2008.135","DOIUrl":"https://doi.org/10.1109/ICDM.2008.135","url":null,"abstract":"Since Jim Gray introduced the concept of rdquodata cuberdquo in 1997, data cube, associated with online analytical processing (OLAP), has become a driving engine in data warehouse industry. Because the boom of Internet has given rise to an ever increasing amount of text data associated with other multidimensional information, it is natural to propose a data cube model that integrates the power of traditional OLAP and IR techniques for text. In this paper, we propose a text-cube model on multidimensional text database and study effective OLAP over such data. Two kinds of hierarchies are distinguishable inside: dimensional hierarchy and term hierarchy. By incorporating these hierarchies, we conduct systematic studies on efficient text-cube implementation, OLAP execution and query processing. Our performance study shows the high promise of our methods.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114241065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frequent Subgraph Retrieval in Geometric Graph Databases","authors":"Sebastian Nowozin, K. Tsuda","doi":"10.1109/ICDM.2008.38","DOIUrl":"https://doi.org/10.1109/ICDM.2008.38","url":null,"abstract":"Discovery of knowledge from geometric graph databases is of particular importance in chemistry and biology, because chemical compounds and proteins are represented as graphs with 3D geometric coordinates. In such applications, scientists are not interested in the statistics of the whole database. Instead they need information about a novel drug candidate or protein at hand, represented as a query graph. We propose a polynomial-delay algorithm for geometric frequent subgraph retrieval. It enumerates all subgraphs of a single given query graph which are frequent geometric epsi-subgraphs under the entire class of rigid geometric transformations in a database. By using geometric epsi-subgraphs, we achieve tolerance against variations in geometry. We compare the proposed algorithm to gSpan on chemical compound data, and we show that for a given minimum support the total number of frequent patterns is substantially limited by requiring geometric matching. Although the computation time per pattern is larger than for non-geometric graph mining, the total time is within a reasonable level even for small minimum support.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"21 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123697518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Document-Word Co-regularization for Semi-supervised Sentiment Analysis","authors":"Vikas Sindhwani, Prem Melville","doi":"10.1109/ICDM.2008.113","DOIUrl":"https://doi.org/10.1109/ICDM.2008.113","url":null,"abstract":"The goal of sentiment prediction is to automatically identify whether a given piece of text expresses positive or negative opinion towards a topic of interest. One can pose sentiment prediction as a standard text categorization problem, but gathering labeled data turns out to be a bottleneck. Fortunately, background knowledge is often available in the form of prior information about the sentiment polarity of words in a lexicon. Moreover, in many applications abundant unlabeled data is also available. In this paper, we propose a novel semi-supervised sentiment prediction algorithm that utilizes lexical prior knowledge in conjunction with unlabeled examples. Our method is based on joint sentiment analysis of documents and words based on a bipartite graph representation of the data. We present an empirical study on a diverse collection of sentiment prediction problems which confirms that our semi-supervised lexical models significantly outperform purely supervised and competing semi-supervised techniques.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129078938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Generative Probabilistic Model for Multi-label Classification","authors":"Hongning Wang, Minlie Huang, Xiaoyan Zhu","doi":"10.1109/ICDM.2008.86","DOIUrl":"https://doi.org/10.1109/ICDM.2008.86","url":null,"abstract":"Traditional discriminative classification method makes little attempt to reveal the probabilistic structure and the correlation within both input and output spaces. In the scenario of multi-label classification, most of the classifiers simply assume the predefined classes are independently distributed, which would definitely hinder the classification performance when there are intrinsic correlations between the classes. In this article, we propose a generative probabilistic model, the Correlated Labeling Model (CoL Model), to formulate the correlation between different classes. The CoL model is presented to capture the correlation between classes and the underlying structures via the latent random variables in a supervised manner. We develop a variational procedure to approximate the posterior distribution and employ the EM algorithm for the empirical Bayes parameter estimation. In our evaluations, the proposed model achieved promising results on various data sets.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130213689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computational Discovery of Motifs Using Hierarchical Clustering Techniques","authors":"Dianhui Wang, Nung Kion Lee","doi":"10.1109/ICDM.2008.21","DOIUrl":"https://doi.org/10.1109/ICDM.2008.21","url":null,"abstract":"Discovery of motifs plays a key role in understanding gene regulation in organisms. Existing tools for motif discovery demonstrate some weaknesses in dealing with reliability and scalability. Therefore, development of advanced algorithms for resolving this problem will be useful. This paper aims to develop data mining techniques for discovering motifs. A mismatch based hierarchical clustering algorithm is proposed in this paper, where three heuristic rules for classifying clusters and a post-processing for ranking and refining the clusters are employed in the algorithm. Our algorithm is evaluated using two sets of DNA sequences with comparisons. Results demonstrate that the proposed techniques in this paper outperform MEME, AlignACE and SOMBRERO for most of the testing datasets.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130159659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning by Propagability","authors":"Bingbing Ni, Shuicheng Yan, A. Kassim, L. Cheong","doi":"10.1109/ICDM.2008.53","DOIUrl":"https://doi.org/10.1109/ICDM.2008.53","url":null,"abstract":"In this paper, we present a novel feature extraction framework, called learning by propagability. The whole learning process is driven by the philosophy that the data labels and optimal feature representation can constitute a harmonic system, namely, the data labels are invariant with respect to the propagation on the similarity-graph constructed by the optimal feature representation. Based on this philosophy, a unified formulation for learning by propagability is proposed for both supervised and semi-supervised configurations. Specifically, this formulation offers the semi-supervised learning two characteristics: 1) unlike conventional semi-supervised learning algorithms which mostly include at least two parameters, this formulation is parameter-free; and 2) the formulation unifies the label propagation and optimal representation pursuing, and thus the label propagation is enhanced by benefiting from the graph constructed with the derived optimal representation instead of the original representation. Extensive experiments on UCI toy data, handwritten digit recognition, and face recognition all validate the effectiveness of our proposed learning framework compared with the state-of-the-art methods for feature extraction and semi-supervised learning.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117121260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-Sensitive Parsimonious Linear Regression","authors":"R. Goetschalckx, K. Driessens, S. Sanner","doi":"10.1109/ICDM.2008.76","DOIUrl":"https://doi.org/10.1109/ICDM.2008.76","url":null,"abstract":"We examine linear regression problems where some features may only be observable at a cost (e.g., in medical domains where features may correspond to diagnostic tests that take time and costs money). This can be important in the context of data mining, in order to obtain the best predictions from the data on a limited cost budget. We define a parsimonious linear regression objective criterion that jointly minimizes prediction error and feature cost. We modify least angle regression algorithms commonly used for sparse linear regression to produce the ParLiR algorithm, which not only provides an efficient and parsimonious solution as we demonstrate empirically, but it also provides formal guarantees that we prove theoretically.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122761664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. M. Qamar, Éric Gaussier, J. Chevallet, Joo-Hwee Lim
{"title":"Similarity Learning for Nearest Neighbor Classification","authors":"A. M. Qamar, Éric Gaussier, J. Chevallet, Joo-Hwee Lim","doi":"10.1109/ICDM.2008.81","DOIUrl":"https://doi.org/10.1109/ICDM.2008.81","url":null,"abstract":"In this paper, we propose an algorithm for learning a general class of similarity measures for kNN classification. This class encompasses, among others, the standard cosine measure, as well as the Dice and Jaccard coefficients. The algorithm we propose is an extension of the voted perceptron algorithm and allows one to learn different types of similarity functions (either based on diagonal, symmetric or asymmetric similarity matrices). The results we obtained show that learning similarity measures yields significant improvements on several collections, for two prediction rules: the standard kNN rule, which was our primary goal, and a symmetric version of it.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"24 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116001428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}