2010 IEEE International Conference on Data Mining最新文献

筛选
英文 中文
Multi-dimensional Mass Estimation and Mass-based Clustering 多维质量估计和基于质量的聚类
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.49
K. Ting, Jonathan R. Wells
{"title":"Multi-dimensional Mass Estimation and Mass-based Clustering","authors":"K. Ting, Jonathan R. Wells","doi":"10.1109/ICDM.2010.49","DOIUrl":"https://doi.org/10.1109/ICDM.2010.49","url":null,"abstract":"Mass estimation, an alternative to density estimation, has been shown recently to be an effective base modelling mechanism for three data mining tasks of regression, information retrieval and anomaly detection. This paper advances this work in two directions. First, we generalise the previously proposed one-dimensional mass estimation to multidimensional mass estimation, and significantly reduce the time complexity to O(ψh) from O(ψh)-making it feasible for a full range of generic problems. Second, we introduce the first clustering method based on mass-it is unique because it does not employ any distance or density measure. The structure of the new mass model enables different parts of a cluster to be identified and merged without expensive evaluations. The characteristics of the new clustering method are: (i) it can identify arbitrary-shape clusters; (ii) it is significantly faster than existing density-based or distance-based methods; and (iii) it is noise-tolerant.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122976665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Accelerating Radius-Margin Parameter Selection for SVMs Using Geometric Bounds 基于几何边界的支持向量机加速半径边界参数选择
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.100
Ben Goodrich, D. Albrecht, P. Tischer
{"title":"Accelerating Radius-Margin Parameter Selection for SVMs Using Geometric Bounds","authors":"Ben Goodrich, D. Albrecht, P. Tischer","doi":"10.1109/ICDM.2010.100","DOIUrl":"https://doi.org/10.1109/ICDM.2010.100","url":null,"abstract":"By considering the geometric properties of the Support Vector Machine (SVM) and Minimal Enclosing Ball (MEB) optimization problems, we show that upper and lower bounds on the radius-margin ratio of an SVM can be efficiently computed at any point during training. We use these bounds to accelerate radius-margin parameter selection by terminating training routines as early as possible, while still obtaining a guarantee that the parameters minimize the radius-margin ratio. Once an SVM has been partially trained on any set of parameters, we also show that these bounds can be used to evaluate and possibly reject neighboring parameter values with little or no additional training required. Empirical results show that, when selecting two parameter values, this process can reduce the number of training iterations required by a factor of 10 or more, while suffering no loss of precision in minimizing the radius-margin ratio.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125115181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Content-Based Methods for Predicting Web-Site Demographic Attributes 基于内容的网站人口统计属性预测方法
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.97
Santosh Kabbur, Eui-Hong Han, G. Karypis
{"title":"Content-Based Methods for Predicting Web-Site Demographic Attributes","authors":"Santosh Kabbur, Eui-Hong Han, G. Karypis","doi":"10.1109/ICDM.2010.97","DOIUrl":"https://doi.org/10.1109/ICDM.2010.97","url":null,"abstract":"Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyper linked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyper linked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126374467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Multi-label Feature Selection for Graph Classification 图分类的多标签特征选择
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.58
Xiangnan Kong, Philip S. Yu
{"title":"Multi-label Feature Selection for Graph Classification","authors":"Xiangnan Kong, Philip S. Yu","doi":"10.1109/ICDM.2010.58","DOIUrl":"https://doi.org/10.1109/ICDM.2010.58","url":null,"abstract":"Nowadays, the classification of graph data has become an important and active research topic in the last decade, which has a wide variety of real world applications, e.g. drug activity predictions and kinase inhibitor discovery. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal sub graph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces which assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the sub graph feature mining process. We derive an evaluation criterion, named gHSIC, to estimate the dependence between sub graph features and multiple labels of graphs. Then a branch-and-bound algorithm is proposed to efficiently search for optimal sub graph features by judiciously pruning the sub graph search space using multiple labels. Empirical studies on real-world tasks demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the sub graph search space using multiple labels.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127376914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Max-Clique: A Top-Down Graph-Based Approach to Frequent Pattern Mining Max-Clique:一种自顶向下基于图的频繁模式挖掘方法
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.73
Yan Xie, Philip S. Yu
{"title":"Max-Clique: A Top-Down Graph-Based Approach to Frequent Pattern Mining","authors":"Yan Xie, Philip S. Yu","doi":"10.1109/ICDM.2010.73","DOIUrl":"https://doi.org/10.1109/ICDM.2010.73","url":null,"abstract":"Frequent pattern mining is a fundamental problem in data mining research. We note that almost all state-of-the art algorithms may not be able to mine very long patterns in a large database with a huge set of frequent patterns. In this paper, we point our research to solve this difficult problem from a different perspective: we focus on mining top-k long maximal frequent patterns because long patterns are in general more interesting ones. Different from traditional level-wise mining or tree-growth strategies, our method works in a top-down manner. We pull large maximal cliques from a pattern graph constructed after some fast initial processing, and directly use such large-sized maximal cliques as promising candidates for long frequent patterns. A separate refinement stage is needed to further transform these candidates into true maximal patterns.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117138503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Network Simplification with Minimal Loss of Connectivity 网络简化与最小的连通性损失
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.133
Fang Zhou, S. Mahler, Hannu (TT) Toivonen
{"title":"Network Simplification with Minimal Loss of Connectivity","authors":"Fang Zhou, S. Mahler, Hannu (TT) Toivonen","doi":"10.1109/ICDM.2010.133","DOIUrl":"https://doi.org/10.1109/ICDM.2010.133","url":null,"abstract":"We propose a novel problem to simplify weighted graphs by pruning least important edges from them. Simplified graphs can be used to improve visualization of a network, to extract its main structure, or as a pre-processing step for other data mining algorithms. We define a graph connectivity function based on the best paths between all pairs of nodes. Given the number of edges to be pruned, the problem is then to select a subset of edges that best maintains the overall graph connectivity. Our model is applicable to a wide range of settings, including probabilistic graphs, flow graphs and distance graphs, since the path quality function that is used to find best paths can be defined by the user. We analyze the problem, and give lower bounds for the effect of individual edge removal in the case where the path quality function has a natural recursive property. We then propose a range of algorithms and report on experimental results on real networks derived from public biological databases. The results show that a large fraction of edges can be removed quite fast and with minimal effect on the overall graph connectivity. A rough semantic analysis of the removed edges indicates that few important edges were removed, and that the proposed approach could be a valuable tool in aiding users to view or explore weighted graphs.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122041022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
Efficient Probabilistic Latent Semantic Analysis with Sparsity Control 基于稀疏度控制的高效概率潜在语义分析
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.136
Sen Liu, Chaolun Xia, Xiaohong Jiang
{"title":"Efficient Probabilistic Latent Semantic Analysis with Sparsity Control","authors":"Sen Liu, Chaolun Xia, Xiaohong Jiang","doi":"10.1109/ICDM.2010.136","DOIUrl":"https://doi.org/10.1109/ICDM.2010.136","url":null,"abstract":"Probabilistic latent semantic analysis is a topic modeling technique to discover the hidden structure in binary and count data. As a mixture model, it performs a probabilistic mixture decomposition on the co-occurrence matrix, which produces two matrices assigned with probabilistic explanations. However, the factorized matrices may be rather smooth, which means we may obtain global feature and topic representations rather than expected local ones. To resolve this problem, one of the solutions is to revise the decomposition process with considerations of sparsity. In this paper, we present an approach that provides direct control over sparsity during the expectation maximization process. Furthermore, by using the log penalty function as sparsity measurement instead of the widely used L2 norm, we can approximate the re-estimation of parameters in linear time, as same as original PLSA does, while many other approaches require much more time. Experiments on face databases are reported to show visual representations on obtaining local features, and detailed improvements in clustering tasks compared with the original process.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114272622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Feature Selection for Unsupervised Learning Using Random Cluster Ensembles 基于随机聚类集成的无监督学习特征选择
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.137
H. Elghazel, A. Aussem
{"title":"Feature Selection for Unsupervised Learning Using Random Cluster Ensembles","authors":"H. Elghazel, A. Aussem","doi":"10.1109/ICDM.2010.137","DOIUrl":"https://doi.org/10.1109/ICDM.2010.137","url":null,"abstract":"In this paper, we propose another extension of the Random Forests paradigm to unlabeled data, leading to localized unsupervised feature selection (FS). We show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to FS in unsupervised learning. We first illustrate the clustering performance of the proposed method on various data sets based on widely used external criteria of clustering quality. We then assess the accuracy and the scalability of the FS procedure on UCI and real labeled data sets and compare its effectiveness against other FS methods.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123123658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Rare Category Characterization 稀有品类特征
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.154
Jingrui He, Hanghang Tong, J. Carbonell
{"title":"Rare Category Characterization","authors":"Jingrui He, Hanghang Tong, J. Carbonell","doi":"10.1109/ICDM.2010.154","DOIUrl":"https://doi.org/10.1109/ICDM.2010.154","url":null,"abstract":"Rare categories abound and their characterization has heretofore received little attention. Fraudulent banking transactions, network intrusions, and rare diseases are examples of rare classes whose detection and characterization are of high value. However, accurate characterization is challenging due to high-skewness and non-separability from majority classes, e.g., fraudulent transactions masquerade as legitimate ones. This paper proposes the RACH algorithm by exploring the compactness property of the rare categories. It is based on an optimization framework which encloses the rare examples by a minimum-radius hyper ball. The framework is then converted into a convex optimization problem, which is in turn effectively solved in its dual form by the projected sub gradient method. RACH can be naturally kernelized. Experimental results validate the effectiveness of RACH.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128135255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Viral Marketing for Multiple Products 多种产品的病毒式营销
2010 IEEE International Conference on Data Mining Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.52
S. Datta, Anirban Majumder, Nisheeth Shrivastava
{"title":"Viral Marketing for Multiple Products","authors":"S. Datta, Anirban Majumder, Nisheeth Shrivastava","doi":"10.1109/ICDM.2010.52","DOIUrl":"https://doi.org/10.1109/ICDM.2010.52","url":null,"abstract":"Viral Marketing, the idea of exploiting social interactions of users to propagate awareness for products, has gained considerable focus in recent years. One of the key issues in this area is to select the best seeds that maximize the influence propagated in the social network. In this paper, we define the seed selection problem (called t-Influence Maximization, or t-IM) for multiple products. Specifically, given the social network and t products along with their seed requirements, we want to select seeds for each product that maximize the overall influence. As the seeds are typically sent promotional messages, to avoid spamming users, we put a hard constraint on the number of products for which any single user can be selected as a seed. In this paper, we design two efficient techniques for the t-IM problem, called Greedy and FairGreedy. The Greedy algorithm uses simple greedy hill climbing, but still results in a 1/3-approximation to the optimum. Our second technique, FairGreedy, allocates seeds with not only high overall influence (close to Greedy in practice), but also ensures fairness across the influence of different products. We also design efficient heuristics for estimating the influence of the selected seeds, that are crucial for running the seed selection on large social network graphs. Finally, using extensive simulations on real-life social graphs, we show the effectiveness and scalability of our techniques compared to existing and naive strategies.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"1989 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125496392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信