{"title":"Community Learning by Graph Approximation","authors":"Bo Long, Xiaoyun Xu, Zhongfei Zhang, Philip S. Yu","doi":"10.1109/ICDM.2007.42","DOIUrl":"https://doi.org/10.1109/ICDM.2007.42","url":null,"abstract":"Learning communities from a graph is an important problem in many domains. Different types of communities can be generalized as link-pattern based communities. In this paper, we propose a general model based on graph approximation to learn link-pattern based community structures from a graph. The model generalizes the traditional graph partitioning approaches and is applicable to learning various community structures. Under this model, we derive a family of algorithms which are flexible to learn various community structures and easy to incorporate the prior knowledge of the community structures. Experimental evaluation and theoretical analysis show the effectiveness and great potential of the proposed model and algorithms.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133515721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization","authors":"Tao Li, C. Ding, Michael I. Jordan","doi":"10.1109/ICDM.2007.98","DOIUrl":"https://doi.org/10.1109/ICDM.2007.98","url":null,"abstract":"Consensus clustering and semi-supervised clustering are important extensions of the standard clustering paradigm. Consensus clustering (also known as aggregation of clustering) can improve clustering robustness, deal with distributed and heterogeneous data sources and make use of multiple clustering criteria. Semi-supervised clustering can integrate various forms of background knowledge into clustering. In this paper, we show how consensus and semi-supervised clustering can be formulated within the framework of nonnegative matrix factorization (NMF). We show that this framework yields NMF-based algorithms that are: (1) extremely simple to implement; (2) provably correct and provably convergent. We conduct a wide range of comparative experiments that demonstrate the effectiveness of this NMF-based approach.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129083987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ding Zhou, Isaac G. Councill, H. Zha, C. Lee Giles
{"title":"Discovering Temporal Communities from Social Network Documents","authors":"Ding Zhou, Isaac G. Councill, H. Zha, C. Lee Giles","doi":"10.1109/ICDM.2007.56","DOIUrl":"https://doi.org/10.1109/ICDM.2007.56","url":null,"abstract":"This paper studies the discovery of communities from social network documents produced over time, addressing the discovery of temporal trends in community memberships. We first formulate static community discovery at a single time period as a tripartite graph partitioning problem. Then we propose to discover the temporal communities by threading the statically derived communities in different time periods using a new constrained partitioning algorithm, which partitions graphs based on topology as well as prior information regarding vertex membership. We evaluate the proposed approach on synthetic datasets and a real-world dataset prepared from the CiteSeer.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125873259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ronen Feldman, Moshe Fresko, J. Goldenberg, O. Netzer, L. Ungar
{"title":"Extracting Product Comparisons from Discussion Boards","authors":"Ronen Feldman, Moshe Fresko, J. Goldenberg, O. Netzer, L. Ungar","doi":"10.1109/ICDM.2007.27","DOIUrl":"https://doi.org/10.1109/ICDM.2007.27","url":null,"abstract":"In recent years, product discussion forums have become a rich environment in which consumers and potential adopters exchange views and information. Researchers and practitioners are starting to extract user sentiment about products from user product reviews. Users often compare different products, stating which they like better and why. Extracting information about product comparisons offers a number of challenges; recognizing and normalizing entities (products) in the informal language of blogs and discussion groups require different techniques than those used for entity extraction in the more formal text of newspapers and scientific articles. We present a case study in extracting information about comparisons between running shoes and between cars, describe an effective methodology, and show how it produces insight into how consumers view the running shoe and car markets.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121771608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"gApprox: Mining Frequent Approximate Patterns from a Massive Network","authors":"Cheng Chen, Xifeng Yan, Feida Zhu, Jiawei Han","doi":"10.1109/ICDM.2007.36","DOIUrl":"https://doi.org/10.1109/ICDM.2007.36","url":null,"abstract":"Recently, there arise a large number of graphs with massive sizes and complex structures in many new applications, such as biological networks, social networks, and the Web, demanding powerful data mining methods. Due to inherent noise or data diversity, it is crucial to address the issue of approximation, if one wants to mine patterns that are potentially interesting with tolerable variations. In this paper, we investigate the problem of mining frequent approximate patterns from a massive network and propose a method called gApprox. gApprox not only finds approximate network patterns, which is the key for many knowledge discovery applications on structural data, but also enriches the library of graph mining methodologies by introducing several novel techniques such as: (1) a complete and redundancy-free strategy to explore the new pattern space faced by gApprox; and (2) transform \"frequent in an approximate sense\" into an anti-monotonic constraint so that it can be pushed deep into the mining process. Systematic empirical studies on both real and synthetic data sets show that frequent approximate patterns mined from the worm protein-protein interaction network are biologically interesting and gApprox is both effective and efficient.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"165 2-3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123504367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering Needles in a Haystack: An Information Theoretic Analysis of Minority and Outlier Detection","authors":"S. Ando","doi":"10.1109/ICDM.2007.53","DOIUrl":"https://doi.org/10.1109/ICDM.2007.53","url":null,"abstract":"Identifying atypical objects is one of the traditional topics in machine learning. Recently, novel approaches, e.g., Minority Detection and One-class clustering, have explored further to identify clusters of atypical objects which strongly contrast from the rest of the data in terms of their distribution or density. This paper analyzes such tasks from an information theoretic perspective. Based on Information Bottleneck formalization, these tasks interpret to increasing the averaged atypicalness of the clusters while reducing the complexity of the clustering. This formalization yields a unifying view of the new approaches as well as the classic outlier detection. We also present a scalable minimization algorithm which exploits the localized form of the cost function over individual clusters. The proposed algorithm is evaluated using simulated datasets and a text classification benchmark, in comparison with an existing method.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134419770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pu Wang, Jian Hu, Hua-Jun Zeng, Lijun Chen, Zheng Chen
{"title":"Improving Text Classification by Using Encyclopedia Knowledge","authors":"Pu Wang, Jian Hu, Hua-Jun Zeng, Lijun Chen, Zheng Chen","doi":"10.1109/ICDM.2007.77","DOIUrl":"https://doi.org/10.1109/ICDM.2007.77","url":null,"abstract":"The exponential growth of text documents available on the Internet has created an urgent need for accurate, fast, and general purpose text classification algorithms. However, the \"bag of words\" representation used for these classification methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with this problem, we integrate background knowledge - in our application: Wikipedia - into the process of classifying text documents. The experimental evaluation on Reuters newsfeeds and several other corpus shows that our classification results with encyclopedia knowledge are much better than the baseline \"bag of words \" methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132184707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Pan, Adam Roberts, L. McMillan, D. Threadgill, Wei Wang
{"title":"Sample Selection for Maximal Diversity","authors":"Feng Pan, Adam Roberts, L. McMillan, D. Threadgill, Wei Wang","doi":"10.1109/ICDM.2007.16","DOIUrl":"https://doi.org/10.1109/ICDM.2007.16","url":null,"abstract":"The problem of selecting a sample subset sufficient to preserve diversity arises in many applications. One example is in the design of recombinant inbred lines (RIL) for genetic association studies. In this context, genetic diversity is measured by how many alleles are retained in the resulting inbred strains. RIL panels that are derived from more than two parental strains, such as the collaborative cross (Churchill et al., 2004), present a particular challenge with regard to which of the many existing lab mouse strains should be included in the initial breeding funnel in order to maximize allele retention. A similar problem occurs in the study of customer reviews when selecting a subset of products with a maximal diversity in reviews. Diversity in this case implies the presence of a set of products having both positive and negative ranks for each customer. In this paper, we demonstrate that selecting an optimal diversity subset is an NP-complete problem via reduction to set cover. This reduction is sufficiently tight that greedy approximations to the set cover problem directly apply to maximizing diversity. We then suggest a slightly modified subset selection problem in which an initial greedy diversity solution is used to effectively prune an exhaustive search for all diversity subsets bounded from below by a specified coverage threshold. Extensive experiments on real datasets are performed to demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126332613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Assent, Ralph Krieger, Emmanuel Müller, T. Seidl
{"title":"DUSC: Dimensionality Unbiased Subspace Clustering","authors":"I. Assent, Ralph Krieger, Emmanuel Müller, T. Seidl","doi":"10.1109/ICDM.2007.49","DOIUrl":"https://doi.org/10.1109/ICDM.2007.49","url":null,"abstract":"To gain insight into today's large data resources, data mining provides automatic aggregation techniques. Clustering aims at grouping data such that objects within groups are similar while objects in different groups are dissimilar. In scenarios with many attributes or with noise, clusters are often hidden in subspaces of the data and do not show up in the full dimensional space. For these applications, subspace clustering methods aim at detecting clusters in any sub- space. Existing subspace clustering approaches fall prey to an effect we call dimensionality bias. As dimensionality of subspaces varies, approaches which do not take this effect into account fail to separate clusters from noise. We give a formal definition of dimensionality bias and analyze consequences for subspace clustering. A dimensionality unbiased subspace clustering (DUSC) definition based on statistical foundations is proposed. In thorough experiments on synthetic and real world data, we show that our approach outperforms existing subspace clustering algorithms.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121625162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Knowledge Discovery in Document Collections through Combining Text Retrieval and Link Analysis Techniques","authors":"Wei Jin, R. Srihari, H. H. Ho, Xin Wu","doi":"10.1109/ICDM.2007.62","DOIUrl":"https://doi.org/10.1109/ICDM.2007.62","url":null,"abstract":"In this paper, we present Concept Chain Queries (CCQ), a special case of text mining in document collections focusing on detecting links between two topics across text documents. We interpret such a query as finding the most meaningful evidence trails across documents that connect these two topics. We propose to use link-analysis techniques over the extracted features provided by Information Extraction Engine for finding new knowledge. A graphical text representation and mining model is proposed which combines information retrieval, association mining and link analysis techniques. We present experiments on different datasets that demonstrate the effectiveness of our algorithm. Specifically, the algorithm generates ranked concept chains and evidence trails where the key terms representing significant relationships between topics are ranked high.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123395951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}