Chang-Shing Perng, Haixun Wang, Sheng Ma, J. Hellerstein
{"title":"User-directed exploration of mining space with multiple attributes","authors":"Chang-Shing Perng, Haixun Wang, Sheng Ma, J. Hellerstein","doi":"10.1109/ICDM.2002.1183931","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183931","url":null,"abstract":"There has been a growing interest in mining frequent itemsets in relational data with multiple attributes. A key step in this approach is to select a set of attributes that group data into transactions and a separate set of attributes that labels data into items. Unsupervised and unrestricted mining, however is stymied by the combinatorial complexity and the quantity of patterns as the number of attributes grows. In this paper we focus on leveraging the semantics of the underlying data for mining frequent itemsets. For instance, there are usually taxonomies in the data schema and functional dependencies among the attributes. Domain knowledge and user preferences often have the potential to significantly reduce the exponentially growing mining space. These observations motivate the design of a user-directed data mining framework that allows such domain knowledge to guide the mining process and control the mining strategy. We show examples of tremendous reduction in computation by using domain knowledge in mining relational data with multiple attributes.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A formal model for user preference","authors":"S. Jung, Jeong-Hee Hong, Taek-Soo Kim","doi":"10.1109/ICDM.2002.1183908","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183908","url":null,"abstract":"Personalization and recommendation systems require a formalized model for user preference. We present the formal model of preference including positive preference and negative preference. For rare events, we apply the probability of random occurrence in order to reduce noise effects caused by data sparseness. Pareto distribution is adopted for the random occurrence probability. We also present the method for combining information of joint feature variables in different sizes by dynamic weighting using random occurrence probability.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125852801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An algebraic approach to data mining: some examples","authors":"R. Grossman, R. Larson","doi":"10.1109/ICDM.2002.1184011","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184011","url":null,"abstract":"We introduce an algebraic approach to the foundations of data mining. Our approach is based upon two algebras of functions defined over a common state space X and a pairing between them. One algebra is an algebra of state space observations, and the other is an algebra of labeled sets of states. We interpret H as the algebraic encoding of the data and the pairing as the misclassification rate when the classifier f is applied to the set of states X. We give a realization theorem giving conditions on formal series of data sets built from D that imply there is a realization involving a state space X, a classifier f /spl isin/ R and a set of labeled states /spl chi/ /spl isin/ R/sub 0/ that yield this series.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125960879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient discovery of common substructures in macromolecules","authors":"S. Parthasarathy, M. Coatney","doi":"10.1109/ICDM.2002.1183924","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183924","url":null,"abstract":"Biological macromolecules play a fundamental role in disease; therefore, they are of great interest to fields such as pharmacology and chemical genomics. Yet due to macromolecules' complexity, development of effective techniques for elucidating structure-function macromolecular relationships has been ill explored. Previous techniques have either focused on sequence analysis, which only approximates structure-function relationships, or on small coordinate datasets, which does not scale to large datasets or handle noise. We present a novel scalable approach to efficiently discover macromolecule substructures based on three-dimensional coordinate data, without domain-specific knowledge. The approach combines structure-based frequent pattern discovery with search space reduction and coordinate noise handling. We analyze computational performance compared to traditional approaches, validate that our approach can discover meaningful substructures in noisy macromolecule data by automated discovery of primary and secondary protein structures, and show that our technique is superior to sequence-based approaches at determining structural, and thus functional, similarity between proteins.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125228030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concept tree based clustering visualization with shaded similarity matrices","authors":"Jun Wang, Bei Yu, L. Gasser","doi":"10.1109/ICDM.2002.1184032","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184032","url":null,"abstract":"One problem with existing clustering methods is that the interpretation of clusters may be difficult. Two different approaches have been used to solve this problem: conceptual clustering in machine learning and clustering visualization in statistics and graphics. The purpose of this paper is to investigate the benefits of combining clustering visualization and conceptual clustering to obtain better cluster interpretations. In our research we have combined concept trees for conceptual clustering with shaded similarity matrices for visualization. Experimentation shows that the two interpretation approaches can complement each other to help us understand data better.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115067219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SmartMiner: a depth first algorithm guided by tail information for mining maximal frequent itemsets","authors":"Q. Zou, W. Chu, Baojing Lu","doi":"10.1109/ICDM.2002.1184003","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184003","url":null,"abstract":"Maximal frequent itemsets (MR) are crucial to many tasks in data mining. Since the MaxMiner algorithm first introduced enumeration trees for mining MR in 1998, several methods have been proposed to use depth first search to improve performance. To further improve the performance of mining MR, we proposed a technique that takes advantage of the information gathered from previous steps to discover new MR. More specifically, our algorithm called SmartMiner gathers and passes tail information and uses a heuristic select function which uses the tail information to select the next node to explore. Compared with Mafia and GenMax, SmartMiner generates a smaller search tree, requires a smaller number of support counting, and does not require superset checking. Using the datasets Mushroom and Connect, our experimental study reveals that SmartMiner generates the same MFI as Mafia and GenMax, but yields an order of magnitude improvement in speed.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129999426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"webSPADE: a parallel sequence mining algorithm to analyze web log data","authors":"A. Demiriz","doi":"10.1109/ICDM.2002.1184046","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184046","url":null,"abstract":"Enterprise-class web sites receive a large amount of traffic, from both registered and anonymous users. Data warehouses are built to store and help analyze the click streams within this traffic to provide companies with valuable insights into the behavior of their customers. This article proposes a parallel sequence mining algorithm, webSPADE, to analyze the click streams found in site web logs. In this process, raw web logs are first cleaned and inserted into a data warehouse. The click streams are then mined by webSPADE. An innovative web-based front-end is used to visualize and query the sequence mining results. The webSPADE algorithm is currently used by Verizon to analyze the daily traffic of the Verizon.com web site.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134507287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TreeFinder: a first step towards XML data mining","authors":"A. Termier, M. Rousset, M. Sebag","doi":"10.1109/ICDM.2002.1183987","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183987","url":null,"abstract":"In this paper we consider the problem of searching frequent trees from a collection of tree-structured data modeling XML data. The TreeFinder algorithm aims at finding trees, such that their exact or perturbed copies are frequent in a collection of labelled trees. To cope with complexity issues, TreeFinder is correct but not complete: it finds a subset of actually frequent trees. The default of completeness is experimentally investigated on artificial medium size datasets; it is shown that TreeFinder reaches completeness or falls short for a range of experimental settings.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134111062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive and resource-aware mining of frequent sets","authors":"S. Orlando, P. Palmerini, R. Perego, F. Silvestri","doi":"10.1109/ICDM.2002.1183921","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183921","url":null,"abstract":"The performance of an algorithm that mines frequent sets from transactional databases may severely depend on the specific features of the data being analyzed. Moreover, some architectural characteristics of the computational platform used - e.g. the available main memory - can dramatically change its runtime behavior. In this paper we present DCI (Direct Count & Intersect), an efficient algorithm for discovering frequent sets from large databases. Due to the multiple heuristics strategies adopted, DCI can adapt its behavior not only to the features of the specific computing platform, but also to the features of the dataset being mined, so that it results very effective in mining both short and long patterns from sparse and dense datasets. Finally we also discuss the parallelization strategies adopted in the design of ParDCI, a distributed and multi-threaded implementation of DCI.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"533 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132949989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Telecommunications strategic marketing - KDD and economic modeling","authors":"Stefano Cazzella, L. Dragone, Stefano Trisolini","doi":"10.1109/ICDM.2002.1184045","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184045","url":null,"abstract":"The Italian deregulation process of telecommunications market in the last years has produced a large economic impact since it has altered equilibriums that were established for a long time. In this framework, we notice a strong need for adequate tools to analyze the market and its trends and, at the same time, a lack of specific solutions within the scientific literature, due to the new technical challenges issued by the problem. In particular, in the context of building a Decision Support System (DSS) for the strategic marketing unit of TELECOM Italia (TI) we have devised a new methodology to profitably combine most powerful tools from KDD and Economic Sciences. We have tested our approach by analyzing the residential telecommunications market demand in Italy during the transition from a monopolistic structure to an oligopolistic one. In this paper, we first address the state of the art in DSS design, then we describe the proposed methodology and its application in the case study.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132551926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}