{"title":"An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks","authors":"Jianhua Ruan, Weixiong Zhang","doi":"10.1109/ICDM.2007.72","DOIUrl":"https://doi.org/10.1109/ICDM.2007.72","url":null,"abstract":"Automatic discovery of community structures in complex networks is a fundamental task in many disciplines, including social science, engineering, and biology. A quantitative measure called modularity (Q) has been proposed to effectively assess the quality of community structures. Several community discovery algorithms have since been developed based on the optimization of Q. However, this optimization problem is NP-hard, and the existing algorithms have a low accuracy or are computationally expensive. In this paper, we present an efficient spectral algorithm for modularity optimization. When tested on a large number of synthetic or real-world networks, and compared to the existing algorithms, our method is efficient and and has a high accuracy. In addition, we have successfully applied our algorithm to detect interesting and meaningful community structures from real-world networks in different domains, including biology, medicine and social science. Due to space limitation, results of these applications are presented in a complete version of the paper available on our Website (http://cse .wustl.edu/ ~jruan/).","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122799265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longin Jan Latecki, Qiang Wang, Suzan Köknar-Tezel, V. Megalooikonomou
{"title":"Optimal Subsequence Bijection","authors":"Longin Jan Latecki, Qiang Wang, Suzan Köknar-Tezel, V. Megalooikonomou","doi":"10.1109/ICDM.2007.47","DOIUrl":"https://doi.org/10.1109/ICDM.2007.47","url":null,"abstract":"We consider the problem of elastic matching of sequences of real numbers. Since both a query and a target sequence may be noisy, i.e., contain some outlier elements, it is desirable to exclude the outlier elements from matching in order to obtain a robust matching performance. Moreover, in many applications like shape alignment or stereo correspondence it is also desirable to have a one-to-one and onto correspondence (bijection) between the remaining elements. We propose an algorithm that determines the optimal subsequence bijection (OSB) of a query and target sequence. The OSB is efficiently computed since we map the problem's solution to a cheapest path in a DAG (directed acyclic graph). We obtained excellent results on standard benchmark time series datasets. We compared OSB to Dynamic Time Warping (DTW) with and without warping window. We do not claim that OSB is always superior to DTW. However, our results demonstrate that skipping outlier elements as done by OSB can significantly improve matching results for many real datasets. Moreover, OSB is particularly suitable for partial matching. We applied it to the object recognition problem when only parts of contours are given. We obtained sequences representing shapes by representing object contours as sequences of curvatures.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114291811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can the Content of Public News Be Used to Forecast Abnormal Stock Market Behaviour?","authors":"Calum S. Robertson, S. Geva, R. Wolff","doi":"10.1109/ICDM.2007.74","DOIUrl":"https://doi.org/10.1109/ICDM.2007.74","url":null,"abstract":"A popular theory of markets is that they are efficient: all available information is deemed to provide an accurate valuation of an asset at any time. In this paper, we consider how the content of market- related news articles contributes to such information. Specifically, we mine news articles for terms of interest, and quantify this degree of interest. We then incorporate this measure into traditional models for market index volatility with a view to forecasting whether the incidence of interesting news is correlated with a shock in the index, and thus if the information can be captured to value the underlying asset. We illustrate the methodology on stock market indices for the USA, the UK, and Australia.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124173308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local Probabilistic Models for Link Prediction","authors":"Chao Wang, Venu Satuluri, S. Parthasarathy","doi":"10.1109/ICDM.2007.108","DOIUrl":"https://doi.org/10.1109/ICDM.2007.108","url":null,"abstract":"One of the core tasks in social network analysis is to predict the formation of links (i.e. various types of relationships) over time. Previous research has generally represented the social network in the form of a graph and has leveraged topological and semantic measures of similarity between two nodes to evaluate the probability of link formation. Here we introduce a novel local probabilistic graphical model method that can scale to large graphs to estimate the joint co-occurrence probability of two nodes. Such a probability measure captures information that is not captured by either topological measures or measures of semantic similarity, which are the dominant measures used for link prediction. We demonstrate the effectiveness of the co-occurrence probability feature by using it both in isolation and in combination with other topological and semantic features for predicting co-authorship collaborations on real datasets.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116317507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks","authors":"Benjamin Arai, Song Lin, D. Gunopulos","doi":"10.1109/ICDM.2007.71","DOIUrl":"https://doi.org/10.1109/ICDM.2007.71","url":null,"abstract":"Performing data-mining tasks such as clustering, classification, and prediction on large datasets is an arduous task and, many times, it is an infeasible task given current hardware limitations. The distributed nature of peer-to-peer databases further complicates this issue by introducing an access overhead cost in addition to the cost of sending individual tuples over the network. We propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various amounts of known and unknown information about the network and its peers.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121713687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Social Network Extraction of Academic Researchers","authors":"Jie Tang, Duo Zhang, Limin Yao","doi":"10.1109/ICDM.2007.30","DOIUrl":"https://doi.org/10.1109/ICDM.2007.30","url":null,"abstract":"This paper addresses the issue of extraction of an academic researcher social network. By researcher social network extraction, we are aimed at finding, extracting, and fusing the 'semantic '-based profiling information of a researcher from the Web. Previously, social network extraction was often undertaken separately in an ad-hoc fashion. This paper first gives a formalization of the entire problem. Specifically, it identifies the 'relevant documents' from the Web by a classifier. It then proposes a unified approach to perform the researcher profiling using conditional random fields (CRF). It integrates publications from the existing bibliography datasets. In the integration, it proposes a constraints-based probabilistic model to name disambiguation. Experimental results on an online system show that the unified approach to researcher profiling significantly outperforms the baseline methods of using rule learning or classification. Experimental results also indicate that our method to name disambiguation performs better than the baseline method using unsupervised learning. The methods have been applied to expert finding. Experiments show that the accuracy of expert finding can be significantly improved by using the proposed methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126265327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Averaged Divergence Analysis","authors":"D. Tao, Xuelong Li, Xindong Wu, S. Maybank","doi":"10.1109/ICDM.2007.105","DOIUrl":"https://doi.org/10.1109/ICDM.2007.105","url":null,"abstract":"Subspace selection is a powerful tool in data mining. An important subspace method is the Fisher-Rao linear discriminant analysis (LDA), which has been successfully applied in many fields such as biometrics, bioinformatics, and multimedia retrieval. However, LDA has a critical drawback: the projection to a subspace tends to merge those classes that are close together in the original feature space. If the separated classes are sampled from Gaussian distributions, all with identical covariance matrices, then LDA maximizes the mean value of the Kullback-Leibler (KL) divergences between the different classes. We generalize this point of view to obtain a framework for choosing a subspace by 1) generalizing the KL divergence to the Bregman divergence and 2) generalizing the arithmetic mean to a general mean. The framework is named the general averaged divergence analysis (GADA). Under this GADA framework, a geometric mean divergence analysis (GMDA) method based on the geometric mean is studied. A large number of experiments based on synthetic data show that our method significantly outperforms LDA and several representative LDA extensions.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131381087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Burstiness to Improve Clustering of Topics in News Streams","authors":"Qi He, Kuiyu Chang, Ee-Peng Lim","doi":"10.1109/ICDM.2007.17","DOIUrl":"https://doi.org/10.1109/ICDM.2007.17","url":null,"abstract":"Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131433775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing Correlation Anomaly Scores Using Stochastic Nearest Neighbors","authors":"T. Idé, S. Papadimitriou, M. Vlachos","doi":"10.1109/ICDM.2007.12","DOIUrl":"https://doi.org/10.1109/ICDM.2007.12","url":null,"abstract":"This paper addresses the task of change analysis of correlated multi-sensor systems. The goal of change analysis is to compute the anomaly score of each sensor when we know that the system has some potential difference from a reference state. Examples include validating the proper performance of various car sensors in the automobile industry. We solve this problem based on a neighborhood preservation principle - If the system is working normally, the neighborhood graph of each sensor is almost invariant against the fluctuations of experimental conditions. Here a neighborhood graph is defined based on the correlation between sensor signals. With the notion of stochastic neighborhood, our method is capable of robustly computing the anomaly score of each sensor under conditions that are hard to be detected by other naive methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124454594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biswanath Panda, Mirek Riedewald, J. Gehrke, S. Pope
{"title":"High-Speed Function Approximation","authors":"Biswanath Panda, Mirek Riedewald, J. Gehrke, S. Pope","doi":"10.1109/ICDM.2007.107","DOIUrl":"https://doi.org/10.1109/ICDM.2007.107","url":null,"abstract":"We address a new learning problem where the goal is to build a predictive model that minimizes prediction time (the time taken to make a prediction) subject to a constraint on model accuracy. Our solution is a generic framework that leverages existing data mining algorithms without requiring any modifications to these algorithms. We show a first application of our framework to a combustion simulation problem. Our experimental evaluation shows significant improvements over existing methods; prediction time typically is improved by a factor between 2 and 6.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127311481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}