{"title":"Temporal Analysis of Semantic Graphs Using ASALSAN","authors":"Brett W. Bader, R. Harshman, T. Kolda","doi":"10.1109/ICDM.2007.54","DOIUrl":"https://doi.org/10.1109/ICDM.2007.54","url":null,"abstract":"ASALSAN is a new algorithm for computing three-way DEDICOM, which is a linear algebra model for analyzing intrinsically asymmetric relationships, such as trade among nations or the exchange of emails among individuals, that incorporates a third mode of the data, such as time. ASALSAN is unique because it enables computing the three-way DEDICOM model on large, sparse data. A nonnegative version of ASALSAN is described as well. When we apply these techniques to adjacency arrays arising from directed graphs with edges labeled by time, we obtain a smaller graph on latent semantic dimensions and gain additional information about their changing relationships over time. We demonstrate these techniques on international trade data and the Enron email corpus to uncover latent components and their transient behavior. The mixture of roles assigned to individuals by ASALSAN showed strong correspondence with known job classifications and revealed the patterns of communication between these roles. Changes in the communication pattern over time, e.g., between top executives and the legal department, were also apparent in the solutions.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Binary Matrix Factorization with Applications","authors":"Zhongyuan Zhang, Tao Li, C. Ding, Xiang-Sun Zhang","doi":"10.1109/ICDM.2007.99","DOIUrl":"https://doi.org/10.1109/ICDM.2007.99","url":null,"abstract":"An interesting problem in nonnegative matrix factorization (NMF) is to factorize the matrix X which is of some specific class, for example, binary matrix. In this paper, we extend the standard NMF to binary matrix factorization (BMF for short): given a binary matrix X, we want to factorize X into two binary matrices W, H (thus conserving the most important integer property of the objective matrix X) satisfying X ap WH. Two algorithms are studied and compared. These methods rely on a fundamental boundedness property of NMF which we propose and prove. This new property also provides a natural normalization scheme that eliminates the bias of factor matrices. Experiments on both synthetic and real world datasets are conducted to show the competency and effectiveness of BMF.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126965828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Subspace Clustering over Multiple Data Streams","authors":"Qi Zhang, Jinze Liu, Wei Wang","doi":"10.1109/ICDM.2007.100","DOIUrl":"https://doi.org/10.1109/ICDM.2007.100","url":null,"abstract":"Data streams are often locally correlated, with a subset of streams exhibiting coherent patterns over a subset of time points. Subspace clustering can discover clusters of objects in different subspaces. However, traditional sub- space clustering algorithms for static data sets are not readily used for incremental clustering, and is very expensive for frequent re-clustering over dynamically changing stream data. In this paper, we present an efficient incremental sub- space clustering algorithm for multiple streams over sliding windows. Our algorithm detects all the delta-CC-Clusters, which capture the coherent changing patterns among a set of streams over a set of time points. delta-CC'-Cluster s are incrementally generated by traversing a directed acyclic graph pDAG. We propose efficient insertion and deletion operations to update the pDAG dynamically. In addition, effective pruning techniques are applied to reduce the search space. Experiments on real data sets demonstrate the performance of our algorithm.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134184167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Frequency Queries for Data Mining Applications","authors":"Hassan H. Malik, J. Kender","doi":"10.1109/ICDM.2007.34","DOIUrl":"https://doi.org/10.1109/ICDM.2007.34","url":null,"abstract":"Data mining algorithms use various Trie and bitmap-based representations to optimize the support (i.e., frequency) counting performance. In this paper, we compare the memory requirements and support counting performance of FP Tree, and Compressed Patricia Trie against several novel variants of vertical bit vectors. First, borrowing ideas from the VLDB domain, we compress vertical bit vectors using WAH encoding. Second, we evaluate the Gray code rank- based transaction reordering scheme, and show that in practice, simple lexicographic ordering, obtained by applying LSB Radix sort, outperforms this scheme. Led by these results, we propose HDO, a novel Hamming-distance-based greedy transaction reordering scheme, and aHDO, a linear-time approximation to HDO. We present results of experiments performed on 15 common datasets with varying degrees of sparseness, and show that HDO- reordered, WAH encoded bit vectors can take as little as 5% of the uncompressed space, while aHDO achieves similar compression on sparse datasets. Finally, with results from over a billion database and data mining style frequency query executions, we show that bitmap-based approaches result in up to hundreds of times faster support counting, and HDO-WAH encoded bitmaps offer the best space-time tradeoff.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129608569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Semantic Kernel for Semi-structured DocumentS","authors":"S. Aseervatham, E. Viennet, Younès Bennani","doi":"10.1109/ICDM.2007.23","DOIUrl":"https://doi.org/10.1109/ICDM.2007.23","url":null,"abstract":"Natural Language Processing has emerged as an active field of research in the machine learning community. Several methods based on statistical information have been proposed. However, with the linguistic complexity of the texts, semantic-based approaches have been investigated. In this paper, we propose a Semantic Kernel for semi- structured biomedical documents. The semantic meanings of words are extracted using the UMLS framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the Semantic Kernel outperforms the Linear Kernel and the Naive Bayes classifier. Moreover, this kernel was ranked in the top ten of the best algorithms among 44 classification methods at the 2007 CMC Medical NLP International Challenge.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133416136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recommendation via Query Centered Random Walk on K-Partite Graph","authors":"H. Cheng, P. Tan, J. Sticklen, W. Punch","doi":"10.1109/ICDM.2007.8","DOIUrl":"https://doi.org/10.1109/ICDM.2007.8","url":null,"abstract":"This paper presents an algorithm for recommending items using a diverse set of features. The items are recommended by performing a random walk on the k-partite graph constructed from the heterogenous features. To support personalized recommendation, the random walk must be initiated separately for each user, which is computationally demanding given the massive size of the graph. To overcome this problem, we apply multi-way clustering to group together the highly correlated nodes. A recommendation is then made by traversing the subgraph induced by clusters associated with a user's interest. Our experimental results on real data sets demonstrate the efficacy of the proposed algorithm.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115150189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Mining Binary and Numerical Attributes","authors":"G. C. Garriga, H. Heikinheimo, J. K. Seppänen","doi":"10.1109/ICDM.2007.32","DOIUrl":"https://doi.org/10.1109/ICDM.2007.32","url":null,"abstract":"We consider the problem of relating itemsets mined on binary attributes of a data set to numerical attributes of the same data. An example is biogeographical data, where the numerical attributes correspond to environmental variables and the binary attributes encode the presence or absence of species in different environments. From the viewpoint of itemset mining, the task is to select a small collection of interesting itemsets using the numerical attributes; from the viewpoint of the numerical attributes, the task is to constrain the search for local patterns (e.g. clusters) using the binary attributes. We give a formal definition of the problem, discuss it theoretically, give a simple constant-factor approximation algorithm, and show by experiments on biogeographical data that the algorithm can capture interesting patterns that would not have been found using either itemset mining or clustering alone.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116050178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Generalization of Proximity Functions for K-Means","authors":"Junjie Wu, Hui Xiong, Jing Chen, Wenjun Zhou","doi":"10.1109/ICDM.2007.59","DOIUrl":"https://doi.org/10.1109/ICDM.2007.59","url":null,"abstract":"K-means is a widely used partitional clustering method. A large amount of effort has been made on finding better proximity (distance) functions for k-means. However, the common characteristics of proximity functions remain unknown. To this end, in this paper, we show that all proximity functions that fit k-means clustering can be generalized as k-means distance, which can be derived by a differentiable convex function. A general proof of sufficient and necessary conditions for k-means distance functions is also provided. In addition, we reveal that k-means has a general uniformization effect; that is, k-means tends to produce clusters with relatively balanced cluster sizes. This uniformization effect of k-means exists regardless of proximity functions. Finally, we have conducted extensive experiments on various real-world data sets, and the results show the evidence of the uniformization effect. Also, we observed that external clustering validation measures, such as entropy and variance of information (VI), have difficulty in measuring clustering quality if data have skewed distributions on class sizes.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116187002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Web Site Recommendation Using HTTP Traffic","authors":"Ming Jia, Shaozhi Ye, Xing Li, J. Dickerson","doi":"10.1109/ICDM.2007.44","DOIUrl":"https://doi.org/10.1109/ICDM.2007.44","url":null,"abstract":"Collaborative Filtering (CF) is widely used in web recommender systems, while most existing CF applications focus on transactions or page views within a single site. In this paper, we build a recommender system prototype, which suggests web sites to users, by collecting browsing events at routers without neither user nor website effort. 100 million HTTP flows, involving 11, 327 websites, are converted to user-site ratings using access frequency as the implicit rating metric. With this rating dataset, we evaluate six CF algorithms including one proposed algorithm based on IP address locality. Our experiments show that the recommendation from K nearest neighbors (Runn) performs the best by 50% p@10 (precision of top 10) and 53% p@5 (precision of top 5). Although the precision is far from ideal, our preliminary results suggest the potential value of such a centralized web site recommender system.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134564467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locally Constrained Support Vector Clustering","authors":"Dragomir Yankov, Eamonn J. Keogh, K. Kan","doi":"10.1109/ICDM.2007.58","DOIUrl":"https://doi.org/10.1109/ICDM.2007.58","url":null,"abstract":"Support vector clustering transforms the data into a high dimensional feature space, where a decision function is computed. In the original space, the function outlines the boundaries of higher density regions, naturally splitting the data into individual clusters. The method, however, though theoretically sound, has certain drawbacks which make it not so appealing to the practitioner. Namely, it is unstable in the presence of outliers and it is hard to control the number of clusters that it identifies. Parametrizing the algorithm incorrectly in noisy settings, can either disguise some objectively present clusters in the data, or can identify a large number of small and nonintuitive clusters. Here, we explore the properties of the data in small regions building a mixture of factor analyzers. The obtained information is used to regularize the complexity of the outlined cluster boundaries, by assigning suitable weighting to each example. The approach is demonstrated to be less susceptible to noise and to outline better interpretable clusters than support vector clustering alone.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130189969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}