{"title":"Rule-Based Platform for Web User Profiling","authors":"Jianping Zhang, Manu Shukla","doi":"10.1109/ICDM.2006.137","DOIUrl":"https://doi.org/10.1109/ICDM.2006.137","url":null,"abstract":"This paper discusses a research project: rule-based Web user profiling platform. In this platform, usage data are encoded as a sequence of events, each of which represents an action performed by a user on a Web service at a given time. An event template is proposed to define event models for different Web services. The platform is rule-based. Rules define profile metrics and determine how to compute profile metrics from usage events. A prototype of the platform was implemented and was applied to generate profiles from page view events. The major contribution of the work is the rule-based approach to user profiling. It is the rules and the event template that provide the flexibility to allow the platform to be configured for different Web services.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116119731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, J. Pei
{"title":"Improving Grouped-Entity Resolution Using Quasi-Cliques","authors":"Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, J. Pei","doi":"10.1109/ICDM.2006.85","DOIUrl":"https://doi.org/10.1109/ICDM.2006.85","url":null,"abstract":"The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in them (e.g., an author entity with a list of citations, a singer entity with song list, or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. The previous approaches toward grouped-entity resolution often rely on textual similarity, and produce a large number of false positives. As a complementing technique, in this paper, we present our experience of applying a recently proposed graph mining technique, Quasi-Clique, atop conventional ER solutions. Our approach exploits contextual information mined from the group of elements per entity in addition to syntactic similarity. Extensive experiments verify that our proposal improves precision and recall up to 83% when used together with a variety of existing ER solutions, but never worsens them.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1924 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127456713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Regularized Least Absolute Deviations Regression and an Efficient Algorithm for Parameter Tuning","authors":"Li Wang, Michael D. Gordon, Ji Zhu","doi":"10.1109/ICDM.2006.134","DOIUrl":"https://doi.org/10.1109/ICDM.2006.134","url":null,"abstract":"Linear regression is one of the most important and widely used techniques for data analysis. However, sometimes people are not satisfied with it because of the following two limitations: 1) its results are sensitive to outliers, so when the error terms are not normally distributed, especially when they have heavy-tailed distributions, linear regression often works badly; 2) its estimated coefficients tend to have high variance, although their bias is low. To reduce the influence of outliers, robust regression models were developed. Least absolute deviation (LAD) regression is one of them. LAD minimizes the mean absolute errors, instead of mean squared errors, so its results are more robust. To address the second limitation, shrinkage methods were proposed, which add a penalty on the size of the coefficients. The LASSO is one of these methods and it uses the L1-norm penalty, which not only reduces the prediction error and the variance of estimated coefficients, but also provides an automatic feature selection function. In this paper, we propose the regularized least absolute deviation (RLAD) regression model, which combines the nice features of the LAD and the LASSO together. The RLAD is a regularization method, whose objective function has the form of \"loss + penalty.\" The \"loss\" is the sum of the absolute deviations and the \"penalty\" is the L1-norm of the coefficient vector. Furthermore, to facilitate parameter tuning, we develop an efficient algorithm which can solve the entire regularization path in one pass. Simulations with various settings are performed to demonstrate its performance. Finally, we apply the algorithm to solve the image reconstruction problem and find interesting results.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127071951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. D. Bruin, Tim K. Cocx, W. Kosters, J. Laros, J. Kok
{"title":"Data Mining Approaches to Criminal Career Analysis","authors":"J. D. Bruin, Tim K. Cocx, W. Kosters, J. Laros, J. Kok","doi":"10.1109/ICDM.2006.47","DOIUrl":"https://doi.org/10.1109/ICDM.2006.47","url":null,"abstract":"Narrative reports and criminal records are stored digitally across individual police departments, enabling the collection of this data to compile a nation-wide database of criminals and the crimes they committed. The compilation of this data through the last years presents new possibilities of analyzing criminal activity through time. Augmenting the traditional, more socially oriented, approach of behavioral study of these criminals and traditional statistics, data mining methods like clustering and prediction enable police forces to get a clearer picture of criminal careers. This allows officers to recognize crucial spots in changing criminal behaviour and deploy resources to prevent these careers from unfolding. Four important factors play a role in the analysis of criminal careers: crime nature, frequency, duration and severity. We describe a tool that extracts these from the database and creates digital profiles for all offenders. It compares all individuals on these profiles by a new distance measure and clusters them accordingly. This method yields a visual clustering of these criminal careers and enables the identification of classes of criminals. The proposed method allows for several user-defined parameters.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125147947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruchaneewan Susomboon, D. Raicu, J. Furst, D. Channin
{"title":"Automatic Single-Organ Segmentation in Computed Tomography Images","authors":"Ruchaneewan Susomboon, D. Raicu, J. Furst, D. Channin","doi":"10.1109/ICDM.2006.24","DOIUrl":"https://doi.org/10.1109/ICDM.2006.24","url":null,"abstract":"In this paper, we propose a hybrid approach for automatic single-organ segmentation in computed tomography (CT) data. The approach consists of three stages: first, a probability image of the organ of interest is obtained by applying a binary classification model obtained using pixel-based texture features; second, an adaptive split-and-merge segmentation algorithm is applied on the organ probability image to remove the noise introduced by the misclassified pixels; and third, the segmented organ's boundaries from the previous stage are iteratively refined using a region growing algorithm. While we applied our approach for liver segmentation in 2-D CT images, a challenging and important task in many medical applications, the proposed approach can be applied for the segmentation of any other organ in CT images. Moreover, the proposed approach can be extended to perform automatic multiple organ segmentation and to build context-sensitive reporting tools for computer-aided diagnosis applications.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131369759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering","authors":"Tao Li, C. Ding","doi":"10.1109/ICDM.2006.160","DOIUrl":"https://doi.org/10.1109/ICDM.2006.160","url":null,"abstract":"The nonnegative matrix factorization (NMF) has been shown recently to be useful for clustering and various extensions and variations of NMF have been proposed recently. Despite significant research progress in this area, few attempts have been made to establish the connections between various factorization methods while highlighting their differences. In this paper we aim to provide a comprehensive study on matrix factorization for clustering. In particular, we present an overview and summary on various matrix factorization algorithms and theoretically analyze the relationships among them. Experiments are also conducted to empirically evaluate and compare various factorization methods. In addition, our study also answers several previously unaddressed yet important questions for matrix factorizations including the interpretation and normalization of cluster posterior and the benefits and evaluation of simultaneous clustering. We expect our study would provide good insights on matrix factorization research for clustering.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128206010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Parallel Graph Mining for CMP Architectures","authors":"G. Buehrer, S. Parthasarathy, Yen-kuang Chen","doi":"10.1109/ICDM.2006.15","DOIUrl":"https://doi.org/10.1109/ICDM.2006.15","url":null,"abstract":"Mining graph data is an increasingly popular challenge, which has practical applications in many areas, including molecular substructure discovery, Web link analysis, fraud detection, and social network analysis. The problem statement is to enumerate all subgraphs occurring in at least sigma graphs of a database, where sigma is a user specified parameter. Chip multiprocessors (CMPs) provide true parallel processing, and are expected to become the de facto standard for commodity computing. In this work, building on the state-of-the-art, we propose an efficient approach to parallelize such algorithms for CMPs. We show that an algorithm which adapts its behavior based on the runtime state of the system can improve system utilization and lower execution times. Most notably, we incorporate dynamic state management to allow memory consumption to vary based on availability. We evaluate our techniques on current day shared memory systems (SMPs) and expect similar performance for CMPs. We demonstrate excellent speedup, 27-fold on 32 processors for several real world datasets. Additionally, we show our dynamic techniques afford this scalability while consuming up to 35% less memory than static techniques.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130433070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, Hang Li
{"title":"Detecting Link Spam Using Temporal Information","authors":"Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, Hang Li","doi":"10.1109/ICDM.2006.51","DOIUrl":"https://doi.org/10.1109/ICDM.2006.51","url":null,"abstract":"How to effectively protect against spam on search ranking results is an important issue for contemporary web search engines. This paper addresses the problem of combating one major type of web spam: 'link spam.' Most of the previous work on anti link spam managed to make use of one snapshot of web data to detect spam, and thus it did not take advantage of the fact that link spam tends to result in drastic changes of links in a short time period. To overcome the shortcoming, this paper proposes using temporal information on links in detection of link spam, as well as other information. Specifically, it defines temporal features such as in-link growth rate (IGR) and in-link death rate (IDR) in a spam classification model (i.e., SVM). Experimental results on web domain graph data show that link spam can be successfully detected with the proposed method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130717760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting for Learning Multiple Classes with Imbalanced Class Distribution","authors":"Yanmin Sun, M. Kamel, Yang Wang","doi":"10.1109/ICDM.2006.29","DOIUrl":"https://doi.org/10.1109/ICDM.2006.29","url":null,"abstract":"Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. This learning difficulty attracts a lot of research interests. Most efforts concentrate on bi-class problems. However, bi-class is not the only scenario where the class imbalance problem prevails. Reported solutions for bi-class applications are not applicable to multi-class problems. In this paper, we develop a cost-sensitive boosting algorithm to improve the classification performance of imbalanced data involving multiple classes. One barrier of applying the cost-sensitive boosting algorithm to the imbalanced data is that the cost matrix is often unavailable for a problem domain. To solve this problem, we apply Genetic Algorithm to search the optimum cost setup of each class. Empirical tests show that the proposed cost-sensitive boosting algorithm improves the classification performances of imbalanced data sets significantly.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131144657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personalization in Context: Does Context Matter When Building Personalized Customer Models?","authors":"M. Gorgoglione, C. Palmisano, A. Tuzhilin","doi":"10.1109/ICDM.2006.125","DOIUrl":"https://doi.org/10.1109/ICDM.2006.125","url":null,"abstract":"The idea that context is important when predicting customer behavior has been maintained by scholars in marketing and data mining. However, no systematic study measuring how much the contextual information really matters in building customer models in personalization applications have been done before. In this paper, we address this problem. To this aim, we collected data containing rich contextual information by developing a special-purpose browser to help users to navigate a well- known e-commerce retail portal and purchase products on its site. The experimental results show that context does matter for the case of modeling behavior of individual customers. The granularity of contextual information also matters, and the effect of contextual information gets diluted during the process of aggregating customers' data.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134112334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}