2011 IEEE 11th International Conference on Data Mining最新文献_第7页

Learning with Minimum Supervision: A General Framework for Transductive Transfer Learning 最小监督下的学习:转导迁移学习的一般框架

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.92

M. T. Bahadori, Yan Liu, Dan Zhang

引用次数: 30

Learning Dirichlet Processes from Partially Observed Groups 从部分观察群中学习狄利克雷过程

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.85

Kumar Avinava Dubey, Indrajit Bhattacharya, M. Das, T. Faruquie, C. Bhattacharyya

{"title":"Learning Dirichlet Processes from Partially Observed Groups","authors":"Kumar Avinava Dubey, Indrajit Bhattacharya, M. Das, T. Faruquie, C. Bhattacharyya","doi":"10.1109/ICDM.2011.85","DOIUrl":"https://doi.org/10.1109/ICDM.2011.85","url":null,"abstract":"Motivated by the task of vernacular news analysis using known news topics from national news-papers, we study the task of topic analysis, where given source datasets with observed topics, data items from a target dataset need to be assigned either to observed source topics or to new ones. Using Hierarchical Dirichlet Processes for addressing this task imposes unnecessary and often inappropriate generative assumptions on the observed source topics. In this paper, we explore Dirichlet Processes with partially observed groups (POG-DP). POG-DP avoids modeling the given source topics. Instead, it directly models the conditional distribution of the target data as a mixture of a Dirichlet Process and the posterior distribution of a Hierarchical Dirichlet Process with known groups and topics. This introduces coupling between selection probabilities of all topics within a source, leading to effective identification of source topics. We further improve on this with a Combinatorial Dirichlet Process with partially observed groups (POG-CDP) that captures finer grained coupling between related topics by choosing intersections between sources. We evaluate our models in three different real-world applications. Using extensive experimentation, we compare against several baselines to show that our model performs significantly better in all three applications.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123766506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 流滑动窗口上封闭序列模式的高效挖掘

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.61

Chuancong Gao, Jianyong Wang, Qingyan Yang

引用次数: 11

Personalized Travel Package Recommendation 个性化旅游套餐推荐

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.118

Qi Liu, Yong Ge, Zhongmou Li, Enhong Chen, Hui Xiong

{"title":"Personalized Travel Package Recommendation","authors":"Qi Liu, Yong Ge, Zhongmou Li, Enhong Chen, Hui Xiong","doi":"10.1109/ICDM.2011.118","DOIUrl":"https://doi.org/10.1109/ICDM.2011.118","url":null,"abstract":"As the worlds of commerce, entertainment, travel, and Internet technology become more inextricably linked, new types of business data become available for creative use and formal analysis. Indeed, this paper provides a study of exploiting online travel information for personalized travel package recommendation. A critical challenge along this line is to address the unique characteristics of travel data, which distinguish travel packages from traditional items for recommendation. To this end, we first analyze the characteristics of the travel packages and develop a Tourist-Area-Season Topic (TAST) model, which can extract the topics conditioned on both the tourists and the intrinsic features (i.e. locations, travel seasons) of the landscapes. Based on this TAST model, we propose a cocktail approach on personalized travel package recommendation. Finally, we evaluate the TAST model and the cocktail approach on real-world travel package data. The experimental results show that the TAST model can effectively capture the unique characteristics of the travel data and the cocktail approach is thus much more effective than traditional recommendation methods for travel package recommendation.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125185742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 217

Simple Multiple Noisy Label Utilization Strategies 简单的多噪声标签利用策略

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.133

V. Sheng

{"title":"Simple Multiple Noisy Label Utilization Strategies","authors":"V. Sheng","doi":"10.1109/ICDM.2011.133","DOIUrl":"https://doi.org/10.1109/ICDM.2011.133","url":null,"abstract":"With the outsourcing of small tasks becoming easier, it is possible to obtain non-expert/imperfect labels at low cost. With low-cost imperfect labeling, it is straightforward to collect multiple labels for the same data items. This paper addresses the strategies of utilizing these multiple labels for improving the performance of supervised learning, based on two basic ideas: majority voting and pair wise solutions. We show several interesting results based on our experiments. The soft majority voting strategies can reduce the bias and roughness, and improve the performance of the directed hard majority voting strategy. Pair wise strategies can completely avoid the bias by having both sides (potential correct and incorrect/noisy information) considered (for binary classification). They have very good performance whenever there are a few or many labels available. However, it could also keep the noise. The improved variation that reduces the impact of the noisy information is recommended. All five strategies investigated are labeling quality agnostic strategies, and can be applied to real world applications directly. The experimental results show some of them perform better than or at least very close to the gnostic strategies.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126979224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Detecting Community Kernels in Large Social Networks 在大型社会网络中检测社区核

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.48

Liaoruo Wang, Tiancheng Lou, Jie Tang, J. Hopcroft

{"title":"Detecting Community Kernels in Large Social Networks","authors":"Liaoruo Wang, Tiancheng Lou, Jie Tang, J. Hopcroft","doi":"10.1109/ICDM.2011.48","DOIUrl":"https://doi.org/10.1109/ICDM.2011.48","url":null,"abstract":"In many social networks, there exist two types of users that exhibit different influence and different behavior. For instance, statistics have shown that less than 1% of the Twitter users (e.g. entertainers, politicians, writers) produce 50% of its content, while the others (e.g. fans, followers, readers) have much less influence and completely different social behavior. In this paper, we define and explore a novel problem called community kernel detection in order to uncover the hidden community structure in large social networks. We discover that influential users pay closer attention to those who are more similar to them, which leads to a natural partition into different community kernels. We propose Greedy and We BA, two efficient algorithms for finding community kernels in large social networks. Greedy is based on maximum cardinality search, while We BA formalizes the problem in an optimization framework. We conduct experiments on three large social networks: Twitter, Wikipedia, and Coauthor, which show that We BA achieves an average 15%-50% performance improvement over the other state-of-the-art algorithms, and We BA is on average 6-2,000 times faster in detecting community kernels.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127039592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software 为精确的软件相似度分析建模高级行为模式

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.104

Taeho Kwon, Z. Su

{"title":"Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software","authors":"Taeho Kwon, Z. Su","doi":"10.1109/ICDM.2011.104","DOIUrl":"https://doi.org/10.1109/ICDM.2011.104","url":null,"abstract":"The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P's behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"03 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127256686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

BibClus: A Clustering Algorithm of Bibliographic Networks by Message Passing on Center Linkage Structure 基于中心链接结构信息传递的书目网络聚类算法

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.27

Xiaoran Xu, Zhihong Deng

{"title":"BibClus: A Clustering Algorithm of Bibliographic Networks by Message Passing on Center Linkage Structure","authors":"Xiaoran Xu, Zhihong Deng","doi":"10.1109/ICDM.2011.27","DOIUrl":"https://doi.org/10.1109/ICDM.2011.27","url":null,"abstract":"Multi-type objects with multi-type relations are ubiquitous in real-world networks, e.g. bibliographic networks. Such networks are also called heterogeneous information networks. However, the research on clustering for heterogeneous information networks is little. A new algorithm, called NetClus, has been proposed in recent two years. Although NetClus is applied on a heterogeneous information network with a star network schema, considering the relations between center objects and all attribute objects linking to them, it ignores the relations between center objects such as citation relations, which also contain rich information. Hence, we think the star network schema cannot be used to characterize all possible relations without integrating the linkage structure among center objects, which we call the Center Linkage Structure, and there has been no practical way good enough to solve it. In this paper, we present a novel algorithm, BibClus, for clustering heterogeneous objects with center linkage structure by taking a bibliographic information network as an example. In BibClus, we build a probabilistic model of pair wise hidden Markov random field (P-HMRF) to characterize the center linkage structure, and convert it to a factor graph. We further combine EM algorithm with factor graph theory, and design an efficient way based on message passing algorithm to inference marginal probabilities and estimate parameters at each iteration of EM. We also study how factor functions affect clustering performance with different function forms and constraints. For evaluating our proposed method, we have conducted thorough experiments on a real dataset that we had crawled from ACM Digital Library. The experimental results show that BibClus is effective and has a much higher quantity than the recently proposed algorithm, NetClus, in both recall and precision.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131821336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Context-Aware Multi-instance Learning Based on Hierarchical Sparse Representation 基于层次稀疏表示的上下文感知多实例学习

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.43

Bing Li, Weihua Xiong, Weiming Hu

引用次数: 8

Multi-Class L2,1-Norm Support Vector Machine 多类L2,1范数支持向量机

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.105

Xiao Cai, F. Nie, Heng Huang, C. Ding

{"title":"Multi-Class L2,1-Norm Support Vector Machine","authors":"Xiao Cai, F. Nie, Heng Huang, C. Ding","doi":"10.1109/ICDM.2011.105","DOIUrl":"https://doi.org/10.1109/ICDM.2011.105","url":null,"abstract":"Feature selection is an essential component of data mining. In many data analysis tasks where the number of data point is much less than the number of features, efficient feature selection approaches are desired to extract meaningful features and to eliminate redundant ones. In the previous study, many data mining techniques have been applied to tackle the above challenging problem. In this paper, we propose a new $ell_{2,1}$-norm SVM, that is, multi-class hinge loss with a structured regularization term for all the classes to naturally select features for multi-class without bothering further heuristic strategy. Rather than directly solving the multi-class hinge loss with $ell_{2,1}$-norm regularization minimization, which has not been solved before due to its optimization difficulty, we are the first to give an efficient algorithm bridging the new problem with a previous solvable optimization problem to do multi-class feature selection. A global convergence proof for our method is also presented. Via the proposed efficient algorithm, we select features across multiple classes with jointly sparsity, emph{i.e.}, each feature has either small or large score over all classes. Comprehensive experiments have been performed on six bioinformatics data sets to show that our method can obtain better or competitive performance compared with exiting state-of-art multi-class feature selection approaches.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129384803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57