Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第8页

Mining top-k frequent items in a data stream with flexible sliding windows 使用灵活的滑动窗口在数据流中挖掘top-k频繁项

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835842

Hoang Thanh Lam, T. Calders

{"title":"Mining top-k frequent items in a data stream with flexible sliding windows","authors":"Hoang Thanh Lam, T. Calders","doi":"10.1145/1835804.1835842","DOIUrl":"https://doi.org/10.1145/1835804.1835842","url":null,"abstract":"We study the problem of finding the k most frequent items in a stream of items for the recently proposed max-frequency measure. Based on the properties of an item, the max-frequency of an item is counted over a sliding window of which the length changes dynamically. Besides being parameterless, this way of measuring the support of items was shown to have the advantage of a faster detection of bursts in a stream, especially if the set of items is heterogeneous. The algorithm that was proposed for maintaining all frequent items, however, scales poorly when the number of items becomes large. Therefore, in this paper we propose, instead of reporting all frequent items, to only mine the top-k most frequent ones. First we prove that in order to solve this problem exactly, we still need a prohibitive amount of memory (at least linear in the number of items). Yet, under some reasonable conditions, we show both theoretically and empirically that a memory-efficient algorithm exists. A prototype of this algorithm is implemented and we present its performance w.r.t. memory-efficiency on real-life data and in controlled experiments with synthetic data.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76170472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Fast euclidean minimum spanning tree: algorithm, analysis, and applications 快速欧几里得最小生成树:算法，分析和应用

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835882

William B. March, P. Ram, Alexander G. Gray

引用次数: 100

Discovering precursors to aviation safety incidents: from massive data to actionable information 发现航空安全事故的先兆:从海量数据到可操作信息

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1866814.1866818

A. Srivastava

引用次数: 1

Clustering by synchronization 通过同步进行集群

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835879

Christian Böhm, C. Plant, Junming Shao, Qinli Yang

{"title":"Clustering by synchronization","authors":"Christian Böhm, C. Plant, Junming Shao, Qinli Yang","doi":"10.1145/1835804.1835879","DOIUrl":"https://doi.org/10.1145/1835804.1835879","url":null,"abstract":"Synchronization is a powerful basic concept in nature regulating a large variety of complex processes ranging from the metabolism in the cell to social behavior in groups of individuals. Therefore, synchronization phenomena have been extensively studied and models robustly capturing the dynamical synchronization process have been proposed, e.g. the Extensive Kuramoto Model. Inspired by the powerful concept of synchronization, we propose Sync, a novel approach to clustering. The basic idea is to view each data object as a phase oscillator and simulate the interaction behavior of the objects over time. As time evolves, similar objects naturally synchronize together and form distinct clusters. Inherited from synchronization, Sync has several desirable properties: The clusters revealed by dynamic synchronization truly reflect the intrinsic structure of the data set, Sync does not rely on any distribution assumption and allows detecting clusters of arbitrary number, shape and size. Moreover, the concept of synchronization allows natural outlier handling, since outliers do not synchronize with cluster objects. For fully automatic clustering, we propose to combine Sync with the Minimum Description Length principle. Extensive experiments on synthetic and real world data demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84337036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

Semi-supervised feature selection for graph classification 图分类的半监督特征选择

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835905

Xiangnan Kong, Philip S. Yu

{"title":"Semi-supervised feature selection for graph classification","authors":"Xiangnan Kong, Philip S. Yu","doi":"10.1145/1835804.1835905","DOIUrl":"https://doi.org/10.1145/1835804.1835905","url":null,"abstract":"The problem of graph classification has attracted great interest in the last decade. Current research on graph classification assumes the existence of large amounts of labeled training graphs. However, in many applications, the labels of graph data are very expensive or difficult to obtain, while there are often copious amounts of unlabeled graph data available. In this paper, we study the problem of semi-supervised feature selection for graph classification and propose a novel solution, called gSSC, to efficiently search for optimal subgraph features with labeled and unlabeled graphs. Different from existing feature selection methods in vector spaces which assume the feature set is given, we perform semi-supervised feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive a feature evaluation criterion, named gSemi, to estimate the usefulness of subgraph features based upon both labeled and unlabeled graphs. Then we propose a branch-and-bound algorithm to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space. Empirical studies on several real-world tasks demonstrate that our semi-supervised feature selection approach can effectively boost graph classification performances with semi-supervised feature selection and is very efficient by pruning the subgraph search space using both labeled and unlabeled graphs.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78494335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 122

A hierarchical information theoretic technique for the discovery of non linear alternative clusterings 一种用于发现非线性可选聚类的层次信息理论技术

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835878

Xuan-Hong Dang, J. Bailey

{"title":"A hierarchical information theoretic technique for the discovery of non linear alternative clusterings","authors":"Xuan-Hong Dang, J. Bailey","doi":"10.1145/1835804.1835878","DOIUrl":"https://doi.org/10.1145/1835804.1835878","url":null,"abstract":"Discovery of alternative clusterings is an important method for exploring complex datasets. It provides the capability for the user to view clustering behaviour from different perspectives and thus explore new hypotheses. However, current algorithms for alternative clustering have focused mainly on linear scenarios and may not perform as desired for datasets containing clusters with non linear shapes. Our goal in this paper is to address this challenge of non linearity. In particular, we propose a novel algorithm to uncover an alternative clustering that is distinctively different from an existing, reference clustering. Our technique is information theory based and aims to ensure alternative clustering quality by maximizing the mutual information between clustering labels and data observations, whilst at the same time ensuring alternative clustering distinctiveness by minimizing the information sharing between the two clusterings. We perform experiments to assess our method against a large range of alternative clustering algorithms in the literature. We show our technique's performance is generally better for non-linear scenarios and furthermore, is highly competitive even for simpler, linear scenarios.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88554295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Diagnosing memory leaks using graph mining on heap dumps 在堆转储上使用图挖掘诊断内存泄漏

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835822

Evan K. Maxwell, Godmar Back, Naren Ramakrishnan

{"title":"Diagnosing memory leaks using graph mining on heap dumps","authors":"Evan K. Maxwell, Godmar Back, Naren Ramakrishnan","doi":"10.1145/1835804.1835822","DOIUrl":"https://doi.org/10.1145/1835804.1835822","url":null,"abstract":"Memory leaks are caused by software programs that prevent the reclamation of memory that is no longer in use. They can cause significant slowdowns, exhaustion of available storage space and, eventually, application crashes. Detecting memory leaks is challenging because real-world applications are built on multiple layers of software frameworks, making it difficult for a developer to know whether observed references to objects are legitimate or the cause of a leak. We present a graph mining solution to this problem wherein we analyze heap dumps to automatically identify subgraphs which could represent potential memory leak sources. Although heap dumps are commonly analyzed in existing heap profiling tools, our work is the first to apply a graph grammar mining solution to this problem. Unlike classical graph mining work, we show that it suffices to mine the dominator tree of the heap dump, which is significantly smaller than the underlying graph. Our approach identifies not just leaking candidates and their structure, but also provides aggregate information about the access path to the leaks. We demonstrate several synthetic as well as real-world examples of heap dumps for which our approach provides more insight into the problem than state-of-the-art tools such as Eclipse's MAT.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86224636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Unifying dependent clustering and disparate clustering for non-homogeneous data 统一非同构数据的依赖聚类和异构聚类

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835880

M. S. Hossain, S. Tadepalli, L. Watson, I. Davidson, R. Helm, Naren Ramakrishnan

引用次数: 28

Balanced allocation with succinct representation 均衡的分配，简洁的表示

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804.1835872

S. Alaei, Ravi Kumar, Azarakhsh Malekian, Erik Vee

引用次数: 4

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining 第16届ACM SIGKDD知识发现与数据挖掘国际会议论文集

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI: 10.1145/1835804

Bharat Rao, Balaji Krishnapuram, A. Tomkins, Qiang Yang

{"title":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","authors":"Bharat Rao, Balaji Krishnapuram, A. Tomkins, Qiang Yang","doi":"10.1145/1835804","DOIUrl":"https://doi.org/10.1145/1835804","url":null,"abstract":"KDD-2010, the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, is being held in Washington, DC, USA, on July 24--28, 2010. KDD is the leading international forum for the exchange of research results and practical experience in the field of knowledge discovery and data mining. As the quantity of data available to organizations and individuals continues to grow rapidly, and the need to extract useful knowledge from them becomes more intense, scientists, government workers and business people turn to the KDD community for solutions. This volume contains a snapshot of a year of developments in this field; we hope you will find it useful and rewarding. \u0000 \u0000The KDD-2010 technical program features four parallel research tracks and an industrial / government track. The program also features keynotes from leading creators and consumers of KDD technology, 12 workshops, 12 tutorials and one panel. The 2010 KDD Cup competition focuses on educational data mining to support improvements in the field of computer aided instruction. Dozens of technical demonstrations and exhibits from vendors and other organizations underscore the conference's dual role as the leading industry and academic forum to discuss the advances in this field of research. \u0000 \u0000The call for papers attracted 578 research papers and 101 industrial and government submissions from around the world. Each paper was independently reviewed by three members of the program committee for originality, significance, technical quality, and clarity of presentation. This year's research track introduced an author-feedback phase in the review process, in which authors were invited to comment on the preliminary reviews that they received. The objective of the feedback phase is to ensure greater transparency and fairness, as the authors' responses are taken into account in a subsequent discussion phase moderated by Senior Program Committee (SPC) members. There was much discussion among the reviewers in the subsequent discussion phase before the final decisions. In the end, the program committee accepted 77 papers for long presentations and 24 papers for short presentations into the research track, representing an aggregated acceptance rate of 17.4%. \u0000 \u0000This year's Industry and Government track emphasized the successful uses of KDD technology, including deployed applications incorporating KDD technologies and discoveries of valid, novel, understandable, and demonstrably useful patterns from large datasets in industry and government, as well as emerging applications and technology, including challenges and issues arising from attempts to deploy KDD technology to solve specific industry or government problems. The industry and government track of the conference accepted 11 papers for long presentations and 9 papers for short presentations into the program, representing an aggregated acceptance rate of 19.8%. \u0000 \u0000We are glad to see that the conference remains strongly competitive and o","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90148781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25