{"title":"Depth-Based Novelty Detection and Its Application to Taxonomic Research","authors":"Yixin Chen, H. Bart, Xin Dang, Hanxiang Peng","doi":"10.1109/ICDM.2007.10","DOIUrl":"https://doi.org/10.1109/ICDM.2007.10","url":null,"abstract":"It is estimated that less than 10 percent of the world's species have been described, yet species are being lost daily due to human destruction of natural habitats. The job of describing the earth's remaining species is exacerbated by the shrinking number of practicing taxonomists and the very slow pace of traditional taxonomic research. In this article, we tackle, from a novelty detection perspective, one of the most important and challenging research objectives in taxonomy - new species identification. We propose a unique and efficient novelty detection framework based on statistical depth functions. Statistical depth functions provide from the \"deepest\" point a \"center-outward ordering\" of multidimensional data. In this sense, they can detect observations that appear extreme relative to the rest of the observations, i.e., novelty. Of the various statistical depths, the spatial depth is especially appealing because of its computational efficiency and mathematical tractability. We propose a novel statistical depth, the kernelized spatial depth (KSD) that generalizes the spatial depth via positive definite kernels. By choosing a proper kernel, the KSD can capture the local structure of a data set while the spatial depth fails. Observations with depth values less than a threshold are declared as novel. The proposed algorithm is simple in structure: the threshold is the only one parameter for a given kernel. We give an upper bound on the false alarm probability of a depth-based detector, which can be used to determine the threshold. Experimental study demonstrates its excellent potential in new species discovery.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134360682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ding Zhou, Sergey A. Orshanskiy, H. Zha, C. Lee Giles
{"title":"Co-ranking Authors and Documents in a Heterogeneous Network","authors":"Ding Zhou, Sergey A. Orshanskiy, H. Zha, C. Lee Giles","doi":"10.1109/ICDM.2007.57","DOIUrl":"https://doi.org/10.1109/ICDM.2007.57","url":null,"abstract":"Recent graph-theoretic approaches have demonstrated remarkable successes for ranking networked entities, but most of their applications are limited to homogeneous networks such as the network of citations between publications. This paper proposes a novel method for co-ranking authors and their publications using several networks: the social network connecting the authors, the citation network connecting the publications, as well as the authorship network that ties the previous two together. The new co-ranking framework is based on coupling two random walks, that separately rank authors and documents following the PageRankparadigm. As a result, improved rankings of documents and their authors depend on each other in a mutually reinforcing way, thus taking advantage of the additional information implicit in the heterogeneous network of authors and documents.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133148616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Learning Algorithm for Tree Similarity","authors":"A. Takasu, Daiji Fukagawa, T. Akutsu","doi":"10.1109/ICDM.2007.38","DOIUrl":"https://doi.org/10.1109/ICDM.2007.38","url":null,"abstract":"Tree edit distance is one of the most frequently used distance measures for comparing trees. When using the tree edit distance, we need to determine the cost of each operation, but this is a labor-intensive and highly skilled task. This paper proposes an algorithm for learning the costs of tree edit operations from training data consisting of pairs of similar trees. To formalize the cost learning problem, we define a probabilistic model for tree alignment that is a variant of tree edit distance. Then, the parameters of the model are estimated using the expectation maximization (EM) technique. In this paper, we develop an algorithm for parameter learning that is polynomial in time (O{mn2d6)) and space (O{n2d4)) where n, d, and m represent the size of the trees, the maximum degree of trees, and the number of training pairs of trees, respectively.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133270547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-supervised Document Clustering via Active Learning with Pairwise Constraints","authors":"Ruizhang Huang, Wai Lam","doi":"10.1109/ICDM.2007.79","DOIUrl":"https://doi.org/10.1109/ICDM.2007.79","url":null,"abstract":"This paper investigates a framework that discovers pair-wise constraints for semi-supervised text document clustering. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. A gain directed document pair selection method that measures how much we can learn by revealing the relationships between pairs of documents is designed. Three different models, namely, uncertainty model, generation error model, and objective function model are proposed. Language modeling is investigated for representing clusters in the semi-supervised document clustering approach.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133349862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Discovery of Frequent Approximate Sequential Patterns","authors":"Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu","doi":"10.1109/ICDM.2007.75","DOIUrl":"https://doi.org/10.1109/ICDM.2007.75","url":null,"abstract":"We propose an efficient algorithm for mining frequent approximate sequential patterns under the Hamming distance model. Our algorithm gains its efficiency by adopting a \"break-down-and-build-up\" methodology. The \"breakdown\" is based on the observation that all occurrences of a frequent pattern can be classified into groups, which we call strands. We developed efficient algorithms to quickly mine out all strands by iterative growth. In the \"build-up\" stage, these strands are grouped up to form the support sets from which all approximate patterns would be identified. A salient feature of our algorithm is its ability to grow the frequent patterns by iteratively assembling building blocks of significant sizes in a local search fashion. By avoiding incremental growth and global search, we achieve greater efficiency without losing the completeness of the mining result. Our experimental studies demonstrate that our algorithm is efficient in mining globally repeating approximate sequential patterns that would have been missed by existing methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114069186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Statistical Information of Frequent Fault-Tolerant Patterns in Transactional Databases","authors":"Ardian Kristanto Poernomo, V. Gopalkrishnan","doi":"10.1109/ICDM.2007.48","DOIUrl":"https://doi.org/10.1109/ICDM.2007.48","url":null,"abstract":"Constraints applied on classic frequent patterns are too strict and may cause interesting patterns to be missed. Hence, researchers have proposed to mine a more relaxed version of frequent patterns, where transactions are allowed to miss some items in the itemset they support. Patterns exhibiting such \"faults\" are called frequent fault-tolerant patterns (FFT-patterns) if they are significant in number. In this paper, the term \"pattern\" is distinguished from \"item- set\" as referring to a pair (tidset times itemset). Unlike classical frequent patterns, the number of FFT- patterns grows exponentially not only with the number of items, but also with the number of transactions. Since the latter may reach millions, mining FFT-patterns by enumerating them becomes infeasible. Hence, the challenge is to represent FFT-patterns concisely without losing any useful information. To address this, we draw on the observation that, in transactional databases, the transactions themselves are not important from the data mining point-of- view; i.e. researchers are interested in finding itemsets contained in lots of transactions, rather than in the transactions per se. Therefore, we propose to mine only the frequent itemsets along with the statistical information of the supporting transaction sets, rather than enumerate entire FFT- patterns. Then we present our approach - the BIAS framework, consisting of Backtracking algorithm, Integer Linear Programming (ILP) constraints, and aggregation statistics to solve this problem. Algorithms under this framework not only increase the efficiency of the FFT-patterns mining process by more than an order of magnitude, but also provide a more comprehensive analysis of FFT-Patterns.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128659311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Meta-Learning Rule Learning Heuristics","authors":"Frederik Janssen, Johannes Fürnkranz","doi":"10.1109/ICDM.2007.51","DOIUrl":"https://doi.org/10.1109/ICDM.2007.51","url":null,"abstract":"The goal of this paper is to investigate to what extent a rule learning heuristic can be learned from experience. To that end, we let a rule learner learn a large number of rules and record their performance on the test set. Subsequently, we train regression algorithms on predicting the test set performance of a rule from its training set characteristics. We investigate several variations of this basic scenario, including the question whether it is better to predict the performance of the candidate rule itself or of the resulting final rule. Our experiments on a number of independent evaluation sets show that the learned heuristics outperform standard rule learning heuristics. We also analyze their behavior in coverage space.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128698542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Folding-In with Dirichlet Kernels for PLSI","authors":"Alexander Hinneburg, H. Gabriel, André Gohr","doi":"10.1109/ICDM.2007.15","DOIUrl":"https://doi.org/10.1109/ICDM.2007.15","url":null,"abstract":"Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116645524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language-Independent Set Expansion of Named Entities Using the Web","authors":"Richard C. Wang, William W. Cohen","doi":"10.1109/icdm.2007.104","DOIUrl":"https://doi.org/10.1109/icdm.2007.104","url":null,"abstract":"Set expansion refers to expanding a given partial set of objects into a more complete set. A well-known example system that does set expansion using the web is Google Sets. In this paper, we propose a novel method for expanding sets of named entities. The approach can be applied to semi-structured documents written in any markup language and in any human language. We present experimental results on 36 benchmark sets in three languages, showing that our system is superior to Google Sets in terms of mean average precision.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117044986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David C. Minnen, C. Isbell, Irfan Essa, Thad Starner
{"title":"Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery","authors":"David C. Minnen, C. Isbell, Irfan Essa, Thad Starner","doi":"10.1109/ICDM.2007.52","DOIUrl":"https://doi.org/10.1109/ICDM.2007.52","url":null,"abstract":"Discovering recurring patterns in time series data is a fundamental problem for temporal data mining. This paper addresses the problem of locating subdimensional motifs in real-valued, multivariate time series, which requires the simultaneous discovery of sets of recurring patterns along with the corresponding relevant dimensions. While many approaches to motif discovery have been developed, most are restricted to categorical data, univariate time series, or multivariate data in which the temporal patterns span all of the dimensions. In this paper, we present an expected linear-time algorithm that addresses a generalization of multivariate pattern discovery in which each motif may span only a subset of the dimensions. To validate our algorithm, we discuss its theoretical properties and empirically evaluate it using several data sets including synthetic data and motion capture data collected by an on-body iner- tial sensor.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115164791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}