{"title":"Unsupervised segmentation of categorical time series into episodes","authors":"P. Cohen, Brent Heeringa, N. Adams","doi":"10.1109/ICDM.2002.1183891","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183891","url":null,"abstract":"This paper describes an unsupervised algorithm for segmenting categorical time series into episodes. The VOTING-EXPERTS algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two \"expert methods\" decide where in the window boundaries should be drawn. The algorithm successfully segments text into words in four languages. The algorithm also segments time series of robot sensor data into subsequences that represent episodes in the life of the robot. We claim that VOTING-EXPERTS finds meaningful episodes in categorical time series because it exploits two statistical characteristics of meaningful episodes.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114436951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phrase-based document similarity based on an index graph model","authors":"Khaled M. Hammouda, M. Kamel","doi":"10.1109/ICDM.2002.1183904","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183904","url":null,"abstract":"Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the document index graph, which indexes web documents based on phrases, rather than single terms only. The semi-structured web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The document index graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The similarity between documents is based on both single term weights and matching phrases weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, enhances web document clustering quality significantly.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134274087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Linear Causal Model discovery using the MML criterion","authors":"Gang Li, H. Dai, Yiqing Tu","doi":"10.1109/ICDM.2002.1183913","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183913","url":null,"abstract":"Determining the causal structure of a domain is a key task in the area of data mining and knowledge discovery. The algorithm proposed by Wallace et al. (1996) has demonstrated its strong ability in discovering Linear Causal Models from given data sets. However some experiments showed that this algorithm experienced difficulty in discovering linear relations with small deviation, and it occasionally gives a negative message length, which should not be allowed. In this paper a more efficient and precise MML encoding scheme is proposed to describe the model structure and the nodes in a Linear Causal Model. The estimation of different parameters is also derived. Empirical results show that the new algorithm outperformed the previous MML-based algorithm in terms of both speed and precision.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133428637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haixun Wang, Chang-Shing Perng, Sheng Ma, Philip S. Yu
{"title":"Mining associations by pattern structure in large relational tables","authors":"Haixun Wang, Chang-Shing Perng, Sheng Ma, Philip S. Yu","doi":"10.1109/ICDM.2002.1183992","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183992","url":null,"abstract":"Association rule mining aims at discovering patterns whose support is beyond a given threshold. Mining patterns composed of items described by an arbitrary subset of attributes in a large relational table represents a new challenge and has various practical applications, including the event management systems that motivated this work. The attribute combinations that define the items in a pattern provide the structural information of the pattern. Current association algorithms do not make full use of the structural information of the patterns: the information is either lost after it is encoded with attribute values, or is constrained by a given hierarchy or taxonomy. Pattern structures convey important knowledge about the patterns. We present an architecture that organizes the mining space based on pattern structures. By exploiting the interrelationships among pattern structures, execution times for mining can be reduced significantly. This advantage is demonstrated by our experiments using both synthetic and real-life datasets.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122216485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised clustering of symbol strings and context recognition","authors":"J. A. Flanagan, Jani Mäntyjärvi, J. Himberg","doi":"10.1109/ICDM.2002.1183900","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183900","url":null,"abstract":"The representation of information based on symbol strings has been applied to the recognition of context. A framework for approaching the context recognition problem has been described and interpreted in terms of symbol string recognition. The symbol string clustering map (SCM) is introduced as an efficient algorithm for the unsupervised clustering and recognition of symbol string data. The SCM can be implemented in an online manner using a computationally simple similarity measure based on a weighted average. It is shown how measured sensor data can be processed by the SCM algorithm to learn, represent and distinguish different user contexts without any user input.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125043579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lazy approach to pruning classification rules","authors":"Elena Baralis, P. Garza","doi":"10.1109/ICDM.2002.1183883","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183883","url":null,"abstract":"Associative classification is a promising technique for the generation of highly precise classifiers. Previous works propose several clever techniques to prune the huge set of generated rules, with the twofold aim of selecting a small set of high quality rules, and reducing the chance of overfitting. In this paper, we argue that pruning should be reduced to a minimum and that the availability of a large rule base may improve the precision of the classifier without affecting its performance. In L/sup 3/ (Live and Let Live), a new algorithm for associative classification, a lazy pruning technique iteratively discards all rules that only yield wrong case classifications. Classification is performed in two steps. Initially, rules which have already correctly classified at least one training case, sorted by confidence, are considered If the case is still unclassified, the remaining rules (unused during the training phase) are considered, again sorted by confidence. Extensive experiments on 26 databases from the UCI machine learning database repository show that L/sup 3/ improves the classification precision with respect to previous approaches.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129144607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Patel, Eamonn J. Keogh, Jessica Lin, S. Lonardi
{"title":"Mining motifs in massive time series databases","authors":"P. Patel, Eamonn J. Keogh, Jessica Lin, S. Lonardi","doi":"10.1109/ICDM.2002.1183925","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183925","url":null,"abstract":"The problem of efficiently locating previously known patterns in a time series database (i.e., query by content) has received much attention and may now largely be regarded as a solved problem. However, from a knowledge discovery viewpoint, a more interesting problem is the enumeration of previously unknown, frequently occurring patterns. We call such patterns \"motifs\", because of their close analogy to their discrete counterparts in computation biology. An efficient motif discovery algorithm for time series would be useful as a tool for summarizing and visualizing massive time series databases. In addition it could be used as a subroutine in various other data mining tasks, including the discovery of association rules, clustering and classification. In this paper we carefully motivate, then introduce, a nontrivial definition of time series motifs. We propose an efficient algorithm to discover them, and we demonstrate the utility and efficiency of our approach on several real world datasets.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124299933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng Ma, J. Hellerstein, Chang-Shing Perng, G. Grabarnik
{"title":"Progressive and interactive analysis of event data using event miner","authors":"Sheng Ma, J. Hellerstein, Chang-Shing Perng, G. Grabarnik","doi":"10.1109/ICDM.2002.1184023","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184023","url":null,"abstract":"Exploring large data sets typically involves activities that iterate between data selection and data analysis, in which insights obtained from analysis result in new data selection. Further, data analysis needs to use a combination of analysis techniques: data summarization, mining algorithms and visualization. This interweaving of functions arises both from the semantics of what the analyst hopes to achieve and from scalability requirements for dealing with large data volumes. We refer to such a process as a progressive analysis. Herein is described a tool, Event Miner, that integrates data selection, mining and visualization for progressive analysis of temporal, categorical data. We discuss a data model and architecture. We illustrate how our tool can be used for complex mining tasks such as finding patterns not occurring on Monday. Further, we discuss the novel visualization employed, such as visualizing categorical data and the results of data mining. Also, we discuss the extension of the existing mining framework needed to mine temporal events with multiple attributes. Throughout, we illustrate the capabilities of Event Miner by applying it to event data from large computer networks.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130955296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining optimal actions for profitable CRM","authors":"C. Ling, Tielin Chen, Qiang Yang, Jie Cheng","doi":"10.1109/ICDM.2002.1184049","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184049","url":null,"abstract":"Data mining has been applied to CRM (Customer Relationship Management) in many industries with a limited success. Most data mining tools can only discover customer models or profiles (such as customers who are likely attritors and customers who are loyal), but not actions that would improve customer relationship (such as changing attritors to loyal customers). We describe a novel algorithm that suggests actions to change customers from an undesired status (such as attritors) to a desired one (such as loyal). Our algorithm takes into account the cost of actions, and further it attempts to maximize the expected net profit. To our best knowledge, no data mining algorithms or tools today can accomplish this important task in CRM. The algorithm is implemented, with many advanced features, in a specialized and highly effective data mining software called Proactive Solution.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128737568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Lavrač, Peter A. Flach, Branko Kavšek, L. Todorovski
{"title":"Adapting classification rule induction to subgroup discovery","authors":"N. Lavrač, Peter A. Flach, Branko Kavšek, L. Todorovski","doi":"10.1109/ICDM.2002.1183912","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183912","url":null,"abstract":"Rule learning is typically used for solving classification and prediction tasks. However learning of classification rules can be adapted also to subgroup discovery. This paper shows how this can be achieved by modifying the covering algorithm and the search heuristic, performing probabilistic classification of instances, and using an appropriate measure for evaluating the results of subgroup discovery. Experimental evaluation of the CN2-SD subgroup discovery algorithm on 17 UCI data sets demonstrates substantial reduction of the number of induced rules, increased rule coverage and rule significance, as well as slight improvements in terms of the area under the ROC curve.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121567770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}