Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第2页

Unsupervised transfer classification: application to text categorization 无监督转移分类:在文本分类中的应用

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835950

Tianbao Yang, Rong Jin, Anil K. Jain, Yang Zhou, Wei Tong

{"title":"Unsupervised transfer classification: application to text categorization","authors":"Tianbao Yang, Rong Jin, Anil K. Jain, Yang Zhou, Wei Tong","doi":"10.1145/1835804.1835950","DOIUrl":"https://doi.org/10.1145/1835804.1835950","url":null,"abstract":"We study the problem of building the classification model for a target class in the absence of any labeled training example for that class. To address this difficult learning problem, we extend the idea of transfer learning by assuming that the following side information is available: (i) a collection of labeled examples belonging to other classes in the problem domain, called the auxiliary classes; (ii) the class information including the prior of the target class and the correlation between the target class and the auxiliary classes. Our goal is to construct the classification model for the target class by leveraging the above data and information. We refer to this learning problem as unsupervised transfer classification. Our framework is based on the generalized maximum entropy model that is effective in transferring the label information of the auxiliary classes to the target class. A theoretical analysis shows that under certain assumption, the classification model obtained by the proposed approach converges to the optimal model when it is learned from the labeled examples for the target class. Empirical study on text categorization over four different data sets verifies the effectiveness of the proposed approach.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88120587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Trust network inference for online rating data using generative models 基于生成模型的在线评级数据信任网络推理

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835917

Freddy Chongtat Chua, Ee-Peng Lim

引用次数: 35

Multi-label learning by exploiting label dependency 利用标签依赖性进行多标签学习

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835930

Min-Ling Zhang, Kun Zhang

{"title":"Multi-label learning by exploiting label dependency","authors":"Min-Ling Zhang, Kun Zhang","doi":"10.1145/1835804.1835930","DOIUrl":"https://doi.org/10.1145/1835804.1835930","url":null,"abstract":"In multi-label learning, each training example is associated with a set of labels and the task is to predict the proper label set for the unseen example. Due to the tremendous (exponential) number of possible label sets, the task of learning from multi-label examples is rather challenging. Therefore, the key to successful multi-label learning is how to effectively exploit correlations between different labels to facilitate the learning process. In this paper, we propose to use a Bayesian network structure to efficiently encode the conditional dependencies of the labels as well as the feature set, with the feature set as the common parent of all labels. To make it practical, we give an approximate yet efficient procedure to find such a network structure. With the help of this network, multi-label learning is decomposed into a series of single-label classification problems, where a classifier is constructed for each label by incorporating its parental labels as additional features. Label sets of unseen examples are predicted recursively according to the label ordering given by the network. Extensive experiments on a broad range of data sets validate the effectiveness of our approach against other well-established methods.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90506361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 439

Learning through exploration 探索学习

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1866290

J. Langford, A. Beygelzimer

引用次数: 1

Discovering significant relaxed order-preserving submatrices 发现重要的松弛保序子矩阵

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835861

Qiong Fang, Wilfred Ng, Jianlin Feng

{"title":"Discovering significant relaxed order-preserving submatrices","authors":"Qiong Fang, Wilfred Ng, Jianlin Feng","doi":"10.1145/1835804.1835861","DOIUrl":"https://doi.org/10.1145/1835804.1835861","url":null,"abstract":"Mining order-preserving submatrix (OPSM) patterns has received much attention from researchers, since in many scientific applications, such as those involving gene expression data, it is natural to express the data in a matrix and also important to find the order-preserving submatrix patterns. However, most current work assumes the noise-free OPSM model and thus is not practical in many real situations when sample contamination exists. In this paper, we propose a relaxed OPSM model called ROPSM. The ROPSM model supports mining more reasonable noise-corrupted OPSM patterns than another well-known model called AOPC (approximate order-preserving cluster). While OPSM mining is known to be an NP-hard problem, mining ROPSM patterns is even a harder problem. We propose a novel method called ROPSM-Growth to mine ROPSM patterns. Specifically, two pattern growing strategies, such as column-centric strategy and row-centric strategy, are presented, which are effective to grow the seed OPSMs into significant ROPSMs. An effective median-rank based method is also developed to discover the underlying true order of conditions involved in an ROPSM pattern. Our experiments on a biological dataset show that the ROPSM model better captures the characteristics of noise in gene expression data matrix compared to the AOPC model. Importantly, we find that our approach is able to detect more quality biologically significant patterns with comparable efficiency with the counterparts of AOPC. Specifically, at least 26.6% (75 out of 282) of the patterns mined by our approach are strongly associated with more than 10 gene categories (high biological significance), which is 3 times better than that obtained from using the AOPC approach.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90753833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Metric forensics: a multi-level approach for mining volatile graphs 度量取证:用于挖掘易变图的多层次方法

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835828

Keith W. Henderson, Tina Eliassi-Rad, C. Faloutsos, L. Akoglu, Lei Li, Koji Maruhashi, B. Prakash, Hanghang Tong

{"title":"Metric forensics: a multi-level approach for mining volatile graphs","authors":"Keith W. Henderson, Tina Eliassi-Rad, C. Faloutsos, L. Akoglu, Lei Li, Koji Maruhashi, B. Prakash, Hanghang Tong","doi":"10.1145/1835804.1835828","DOIUrl":"https://doi.org/10.1145/1835804.1835828","url":null,"abstract":"Advances in data collection and storage capacity have made it increasingly possible to collect highly volatile graph data for analysis. Existing graph analysis techniques are not appropriate for such data, especially in cases where streaming or near-real-time results are required. An example that has drawn significant research interest is the cyber-security domain, where internet communication traces are collected and real-time discovery of events, behaviors, patterns, and anomalies is desired. We propose MetricForensics, a scalable framework for analysis of volatile graphs. MetricForensics combines a multi-level \"drill down\" approach, a collection of user-selected graph metrics, and a collection of analysis techniques. At each successive level, more sophisticated metrics are computed and the graph is viewed at finer temporal resolutions. In this way, MetricForensics scales to highly volatile graphs by only allocating resources for computationally expensive analysis when an interesting event is discovered at a coarser resolution first. We test MetricForensics on three real-world graphs: an enterprise IP trace, a trace of legitimate and malicious network traffic from a research institution, and the MIT Reality Mining proximity sensor data. Our largest graph has 3M vertices and 32M edges, spanning 4.5 days. The results demonstrate the scalability and capability of MetricForensics in analyzing volatile graphs; and highlight four novel phenomena in such graphs: elbows, broken correlations, prolonged spikes, and lightweight stars.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91213927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

Mining periodic behaviors for moving objects 挖掘移动对象的周期行为

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835942

Z. Li, Bolin Ding, Jiawei Han, R. Kays, P. Nye

{"title":"Mining periodic behaviors for moving objects","authors":"Z. Li, Bolin Ding, Jiawei Han, R. Kays, P. Nye","doi":"10.1145/1835804.1835942","DOIUrl":"https://doi.org/10.1145/1835804.1835942","url":null,"abstract":"Periodicity is a frequently happening phenomenon for moving objects. Finding periodic behaviors is essential to understanding object movements. However, periodic behaviors could be complicated, involving multiple interleaving periods, partial time span, and spatiotemporal noises and outliers. In this paper, we address the problem of mining periodic behaviors for moving objects. It involves two sub-problems: how to detect the periods in complex movement, and how to mine periodic movement behaviors. Our main assumption is that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations. Based on this assumption, we propose a two-stage algorithm, Periodica, to solve the problem. At the first stage, the notion of observation spot is proposed to capture the reference locations. Through observation spots, multiple periods in the movement can be retrieved using a method that combines Fourier transform and autocorrelation. At the second stage, a probabilistic model is proposed to characterize the periodic behaviors. For a specific period, periodic behaviors are statistically generalized from partial movement sequences through hierarchical clustering. Empirical studies on both synthetic and real data sets demonstrate the effectiveness of our method.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"42 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89839584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 302

Topic dynamics: an alternative model of bursts in streams of topics 主题动态:主题流中爆发的另一种模型

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835862

Dan He, D. S. Parker

{"title":"Topic dynamics: an alternative model of bursts in streams of topics","authors":"Dan He, D. S. Parker","doi":"10.1145/1835804.1835862","DOIUrl":"https://doi.org/10.1145/1835804.1835862","url":null,"abstract":"For some time there has been increasing interest in the problem of monitoring the occurrence of topics in a stream of events, such as a stream of news articles. This has led to different models of bursts in these streams, i.e., periods of elevated occurrence of events. Today there are several burst definitions and detection algorithms, and their differences can produce very different results in topic streams. These definitions also share a fundamental problem: they define bursts in terms of an arrival rate. This approach is limiting; other stream dimensions can matter. We reconsider the idea of bursts from the standpoint of a simple kind of physics. Instead of focusing on arrival rates, we reconstruct bursts as a dynamic phenomenon, using kinetics concepts from physics -- mass and velocity -- and derive momentum, acceleration, and force from these. We refer to the result as topic dynamics, permitting a hierarchical, expressive model of bursts as intervals of increasing momentum. As a sample application, we present a topic dynamics model for the large PubMed/MEDLINE database of biomedical publications, using the MeSH (Medical Subject Heading) topic hierarchy. We show our model is able to detect bursts for MeSH terms accurately as well as efficiently.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89417503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

Automatic malware categorization using cluster ensemble 使用集群集成的自动恶意软件分类

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835820

Yanfang Ye, Tao Li, Yong Chen, Q. Jiang

{"title":"Automatic malware categorization using cluster ensemble","authors":"Yanfang Ye, Tao Li, Yong Chen, Q. Jiang","doi":"10.1145/1835804.1835820","DOIUrl":"https://doi.org/10.1145/1835804.1835820","url":null,"abstract":"In this paper, resting on the analysis of instruction frequency and function-based instruction sequences, we develop an Automatic Malware Categorization System (AMCS) for automatically grouping malware samples into families that share some common characteristics using a cluster ensemble by aggregating the clustering solutions generated by different base clustering algorithms. We propose a principled cluster ensemble framework for combining individual clustering solutions based on the consensus partition. The domain knowledge in the form of sample-level constraints can be naturally incorporated in the ensemble framework. In addition, to account for the characteristics of feature representations, we propose a hybrid hierarchical clustering algorithm which combines the merits of hierarchical clustering and k-medoids algorithms and a weighted subspace K-medoids algorithm to generate base clusterings. The categorization results of our AMCS system can be used to generate signatures for malware families that are useful for malware detection. The case studies on large and real daily malware collection from Kingsoft Anti-Virus Lab demonstrate the effectiveness and efficiency of our AMCS system.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"81 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73391101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

BioSnowball: automated population of Wikis 生物雪球:维基的自动化人口

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835926

Xiaojiang Liu, Zaiqing Nie, Nenghai Yu, Ji-Rong Wen

{"title":"BioSnowball: automated population of Wikis","authors":"Xiaojiang Liu, Zaiqing Nie, Nenghai Yu, Ji-Rong Wen","doi":"10.1145/1835804.1835926","DOIUrl":"https://doi.org/10.1145/1835804.1835926","url":null,"abstract":"Internet users regularly have the need to find biographies and facts of people of interest. Wikipedia has become the first stop for celebrity biographies and facts. However, Wikipedia can only provide information for celebrities because of its neutral point of view (NPOV) editorial policy. In this paper we propose an integrated bootstrapping framework named BioSnowball to automatically summarize the Web to generate Wikipedia-style pages for any person with a modest web presence. In BioSnowball, biography ranking and fact extraction are performed together in a single integrated training and inference process using Markov Logic Networks (MLNs) as its underlying statistical model. The bootstrapping framework starts with only a small number of seeds and iteratively finds new facts and biographies. As biography paragraphs on the Web are composed of the most important facts, our joint summarization model can improve the accuracy of both fact extraction and biography ranking compared to decoupled methods in the literature. Empirical results on both a small labeled data set and a real Web-scale data set show the effectiveness of BioSnowball. We also empirically show that BioSnowball outperforms the decoupled methods.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79730945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18