2011 IEEE 11th International Conference on Data Mining最新文献_第10页

Document Clustering via Matrix Representation 基于矩阵表示的文档聚类

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.59

Xufei Wang, Jiliang Tang, Huan Liu

{"title":"Document Clustering via Matrix Representation","authors":"Xufei Wang, Jiliang Tang, Huan Liu","doi":"10.1109/ICDM.2011.59","DOIUrl":"https://doi.org/10.1109/ICDM.2011.59","url":null,"abstract":"Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133274488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Privacy Risk in Graph Stream Publishing for Social Network Data 社交网络数据图流发布中的隐私风险

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.120

Nigel Medforth, Ke Wang

{"title":"Privacy Risk in Graph Stream Publishing for Social Network Data","authors":"Nigel Medforth, Ke Wang","doi":"10.1109/ICDM.2011.120","DOIUrl":"https://doi.org/10.1109/ICDM.2011.120","url":null,"abstract":"To understand how social networks evolve over time, graphs representing the networks need to be published periodically or on-demand. The identity of the participants (nodes) must be anonymized to protect the privacy of the individuals and their relationships (edges) to the other members in the social network. We identify a new form of privacy attack, which we name the degree-trail attack. This attack re-identifies the nodes belonging to a target participant from a sequence of published graphs by comparing the degree of the nodes in the published graphs with the degree evolution of a target. The power of this attack is that the adversary can actively influence the degree of the target individual by interacting with the social network. We show that the adversary can succeed with a high probability even if published graphs are anonymized by strongest known privacy preserving techniques in the literature. Moreover, this success does not depend on the distinctiveness of the target nodes nor require the adversary to behave differently from a normal participant. One of our contributions is a formal method to assess the privacy risk of this type of attacks and empirically study the severity on real social network data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133872377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Constraint Selection-Based Semi-supervised Feature Selection 基于约束选择的半监督特征选择

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.42

Mohammed Hindawi, Kais Allab, K. Benabdeslem

引用次数: 22

Class Imbalance, Redux 职业失衡，Redux

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.33

Byron C. Wallace, Kevin Small, C. Brodley, T. Trikalinos

引用次数: 193

Finding Communities in Dynamic Social Networks 在动态社交网络中寻找社区

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.67

Chayant Tantipathananandh, T. Berger-Wolf

{"title":"Finding Communities in Dynamic Social Networks","authors":"Chayant Tantipathananandh, T. Berger-Wolf","doi":"10.1109/ICDM.2011.67","DOIUrl":"https://doi.org/10.1109/ICDM.2011.67","url":null,"abstract":"Communities are natural structures observed in social networks and are usually characterized as \"relatively dense\" subsets of nodes. Social networks change over time and so do the underlying community structures. Thus, to truly uncover this structure we must take the temporal aspect of networks into consideration. Previously, we have represented framework for finding dynamic communities using the social cost model and formulated the corresponding optimization problem [33], assuming that partitions of individuals into groups are given in each time step. We have also presented heuristics and approximation algorithms for the problem, with the same assumption [32]. In general, however, dynamic social networks are represented as a sequence of graphs of snapshots of the social network and the assumption that we have partitions of individuals into groups does not hold. In this paper, we extend the social cost model and formulate an optimization problem of finding community structure from the sequence of arbitrary graphs. We propose a semi definite programming formulation and a heuristic rounding scheme. We show, using synthetic data sets, that this method is quite accurate on synthetic data sets and present its results on a real social network.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125409889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 84

A Study of Laplacian Spectra of Graph for Subgraph Queries 子图查询中图的拉普拉斯谱研究

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.17

Lei Zhu, Qinbao Song

引用次数: 2

Multi-instance Metric Learning 多实例度量学习

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.106

Ye Xu, Wei Ping, A. Campbell

{"title":"Multi-instance Metric Learning","authors":"Ye Xu, Wei Ping, A. Campbell","doi":"10.1109/ICDM.2011.106","DOIUrl":"https://doi.org/10.1109/ICDM.2011.106","url":null,"abstract":"Multi-instance learning, like other machine learning and data mining tasks, requires distance metrics. Although metric learning methods have been studied for many years, metric learners for multi-instance learning remain almost untouched. In this paper, we propose a framework called Multi-Instance MEtric Learning (MIMEL) to learn an appropriate distance under the multi-instance setting. The distance metric between two bags is defined using the Mahalanobis distance function. The problem is formulated by minimizing the KL divergence between two multivariate Gaussians under the constraints of maximizing the between-class bag distance and minimizing the within-class bag distance. To exploit the mechanism of how instances determine bag labels in multi-instance learning, we design a nonparametric density-estimation-based weighting scheme to assign higher â€œweightsâ€ to the instances that are more likely to be positive in positive bags. The weighting scheme itself has a small workload, which adds little extra computing costs to the proposed framework. Moreover, to further boost the classification accuracy, a kernel version of MIMEL is presented. We evaluate MIMEL, using not only several typical multi-instance tasks, but also two activity recognition datasets. The experimental results demonstrate that MIMEL achieves better classification accuracy than many state-of-the-art distance based algorithms or kernel methods for multi-instance learning.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124860881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Entropy-Based Graph Clustering: Application to Biological and Social Networks 基于熵的图聚类:在生物和社会网络中的应用

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.64

Edward Casey Kenley, Young-Rae Cho

{"title":"Entropy-Based Graph Clustering: Application to Biological and Social Networks","authors":"Edward Casey Kenley, Young-Rae Cho","doi":"10.1109/ICDM.2011.64","DOIUrl":"https://doi.org/10.1109/ICDM.2011.64","url":null,"abstract":"Complex systems have been widely studied to characterize their structural behaviors from a topological perspective. High modularity is one of the recurrent features of real-world complex systems. Various graph clustering algorithms have been applied to identifying communities in social networks or modules in biological networks. However, their applicability to real-world systems has been limited because of the massive scale and complex connectivity of the networks. In this study, we exploit a novel information-theoretic model for graph clustering. The entropy-based clustering approach finds locally optimal clusters by growing a random seed in a manner that minimizes graph entropy. We design and analyze modifications that further improve its performance. Assigning priority in seed-selection and seed-growth is well applicable to the scale-free networks characterized by the hub-oriented structure. Computing seed-growth in parallel streams also decomposes an extremely large network efficiently. The experimental results with real biological and social networks show that the entropy-based approach has better performance than competing methods in terms of accuracy and efficiency.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125268842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Finding Novel Diagnostic Gene Patterns Based on Interesting Non-redundant Contrast Sequence Rules 基于有趣的非冗余对比序列规则寻找新的诊断基因模式

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.68

Yuhai Zhao, Guoren Wang, Yuan Li, Zhanghui Wang

{"title":"Finding Novel Diagnostic Gene Patterns Based on Interesting Non-redundant Contrast Sequence Rules","authors":"Yuhai Zhao, Guoren Wang, Yuan Li, Zhanghui Wang","doi":"10.1109/ICDM.2011.68","DOIUrl":"https://doi.org/10.1109/ICDM.2011.68","url":null,"abstract":"Diagnostic genes refer to the genes closely related to a specific disease phenotype, the powers of which to distinguish between different classes are often high. Most methods to discovering the powerful diagnostic genes are either singleton discriminability-based or combination discriminability-based. However, both ignore the abundant interactions among genes, which widely exist in the real world. In this paper, we tackle the problem from a new point of view and make the following contributions: (1) we propose an EWave model, which profitably exploits the ordered expressions among genes based on the defined equivalent dimension group sequences taking into account the \"noise\" universal in the real data, (2) we devise a novel sequence rule, namely interesting non-redundant contrast sequence rule, which is able to capture the difference between different phenotypes in a high accuracy using as few as possible genes, (3) we present an efficient algorithm called NRMINER to find such rules. Unlike the conventional column enumeration and the more recent row enumeration, it performs a novel template-driven enumeration by making use of the special characteristic of micro array data modeled by EWave. Extensive experiments conducted on various synthetic and real datasets show that: (1) NRMINER is significantly faster than the competing algorithm by up to about one order of magnitude, (2) it provides a higher accuracy using fewer genes. Many diagnostic genes discovered by NRMINER are proved biologically related to some disease.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130098483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Mixture of Softmax sLDA Softmax sLDA的混合物

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.103

Xiaoxu Li, Junyu Zeng, Xiaojie Wang, Yixin Zhong

引用次数: 1