Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第7页

Trace complexity of network inference 跟踪网络推理的复杂性

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487664

B. Abrahao, Flavio Chierichetti, Robert D. Kleinberg, A. Panconesi

{"title":"Trace complexity of network inference","authors":"B. Abrahao, Flavio Chierichetti, Robert D. Kleinberg, A. Panconesi","doi":"10.1145/2487575.2487664","DOIUrl":"https://doi.org/10.1145/2487575.2487664","url":null,"abstract":"The network inference problem consists of reconstructing the edge set of a network given traces representing the chronology of infection times as epidemics spread through the network. This problem is a paradigmatic representative of prediction tasks in machine learning that require deducing a latent structure from observed patterns of activity in a network, which often require an unrealistically large number of resources (e.g., amount of available data, or computational time). A fundamental question is to understand which properties we can predict with a reasonable degree of accuracy with the available resources, and which we cannot. We define the trace complexity as the number of distinct traces required to achieve high fidelity in reconstructing the topology of the unobserved network or, more generally, some of its properties. We give algorithms that are competitive with, while being simpler and more efficient than, existing network inference approaches. Moreover, we prove that our algorithms are nearly optimal, by proving an information-theoretic lower bound on the number of traces that an optimal inference algorithm requires for performing this task in the general case. Given these strong lower bounds, we turn our attention to special cases, such as trees and bounded-degree graphs, and to property recovery tasks, such as reconstructing the degree distribution without inferring the network. We show that these problems require a much smaller (and more realistic) number of traces, making them potentially solvable in practice.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78188387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

Exploiting user clicks for automatic seed set generation for entity matching 利用用户点击自动生成种子集进行实体匹配

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487662

Xiao Bai, F. Junqueira, Srinivasan H. Sengamedu

{"title":"Exploiting user clicks for automatic seed set generation for entity matching","authors":"Xiao Bai, F. Junqueira, Srinivasan H. Sengamedu","doi":"10.1145/2487575.2487662","DOIUrl":"https://doi.org/10.1145/2487575.2487662","url":null,"abstract":"Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90063965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Beyond myopic inference in big data pipelines 超越大数据管道的短视推断

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487588

Karthik Raman, Adith Swaminathan, J. Gehrke, T. Joachims

引用次数: 12

DTW-D: time series semi-supervised learning from a single example DTW-D:单例时间序列半监督学习

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487633

Yanping Chen, Bing Hu, Eamonn J. Keogh, Gustavo E. A. P. A. Batista

引用次数: 111

Robust principal component analysis via capped norms 基于上限规范的稳健主成分分析

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487604

Qian Sun, Shuo Xiang, Jieping Ye

{"title":"Robust principal component analysis via capped norms","authors":"Qian Sun, Shuo Xiang, Jieping Ye","doi":"10.1145/2487575.2487604","DOIUrl":"https://doi.org/10.1145/2487575.2487604","url":null,"abstract":"In many applications such as image and video processing, the data matrix often possesses simultaneously a low-rank structure capturing the global information and a sparse component capturing the local information. How to accurately extract the low-rank and sparse components is a major challenge. Robust Principal Component Analysis (RPCA) is a general framework to extract such structures. It is well studied that under certain assumptions, convex optimization using the trace norm and l1-norm can be an effective computation surrogate of the difficult RPCA problem. However, such convex formulation is based on a strong assumption which may not hold in real-world applications, and the approximation error in these convex relaxations often cannot be neglected. In this paper, we present a novel non-convex formulation for the RPCA problem using the capped trace norm and the capped l1-norm. In addition, we present two algorithms to solve the non-convex optimization: one is based on the Difference of Convex functions (DC) framework and the other attempts to solve the sub-problems via a greedy approach. Our empirical evaluations on synthetic and real-world data show that both of the proposed algorithms achieve higher accuracy than existing convex formulations. Furthermore, between the two proposed algorithms, the greedy algorithm is more efficient than the DC programming, while they achieve comparable accuracy.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78954851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints 具有性质约束的化合物-蛋白质对可扩展相似性搜索的简洁性区间分裂树

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487637

Yasuo Tabei, Akihiro Kishimoto, Masaaki Kotera, Yoshihiro Yamanishi

{"title":"Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints","authors":"Yasuo Tabei, Akihiro Kishimoto, Masaaki Kotera, Yoshihiro Yamanishi","doi":"10.1145/2487575.2487637","DOIUrl":"https://doi.org/10.1145/2487575.2487637","url":null,"abstract":"Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to in- vent powerful methods to help us find new functional compound-protein pairs on a large scale. We present the succinct interval-splitting tree algorithm (SITA) that efficiently per- forms similarity search in databases for compound-protein pairs with respect to both binary fingerprints and real-valued properties. SITA achieves both time and space efficiency by developing the data structure called interval-splitting trees, which enables to efficiently prune the useless portions of search space, and by incorporating the ideas behind wavelet tree, a succinct data structure to compactly represent trees. We experimentally test SITA on the ability to retrieve similar compound-protein pairs/substrate-product pairs for a query from large databases with over 200 million compound- protein pairs/substrate-product pairs and show that SITA performs better than other possible approaches.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79023550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Estimating sharer reputation via social data calibration 通过社会数据校准估计共享者声誉

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487685

Jaewon Yang, Bee-Chung Chen, D. Agarwal

{"title":"Estimating sharer reputation via social data calibration","authors":"Jaewon Yang, Bee-Chung Chen, D. Agarwal","doi":"10.1145/2487575.2487685","DOIUrl":"https://doi.org/10.1145/2487575.2487685","url":null,"abstract":"Online social networks have become important channels for users to share content with their connections and diffuse information. Although much work has been done to identify socially influential users, the problem of finding \"reputable\" sharers, who share good content, has received relatively little attention. Availability of such reputation scores can be useful or various applications like recommending people to follow, procuring high quality content in a scalable way, creating a content reputation economy to incentivize high quality sharing, and many more. To estimate sharer reputation, it is intuitive to leverage data that records how recipients respond (through clicking, liking, etc.) to content items shared by a sharer. However, such data is usually biased --- it has a selection bias since the shared items can only be seen and responded to by users connected to the sharer in most social networks, and it has a response bias since the response is usually influenced by the relationship between the sharer and the recipient (which may not indicate whether the shared content is good). To correct for such biases, we propose to utilize an additional data source that provides unbiased goodness estimates for a small set of shared items, and calibrate biased social data through a novel multi-level hierarchical model that describes how the unbiased data and biased data are jointly generated according to sharer reputation scores. The unbiased data also provides the ground truth for quantitative evaluation of different methods. Experiments based on such ground-truth data show that our proposed model significantly outperforms existing methods that estimate social influence using biased social data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81119909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees 比最密集子图更密集:提取具有质量保证的最优拟团

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487645

Charalampos E. Tsourakakis, F. Bonchi, A. Gionis, Francesco Gullo, M. A. Tsiarli

{"title":"Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees","authors":"Charalampos E. Tsourakakis, F. Bonchi, A. Gionis, Francesco Gullo, M. A. Tsiarli","doi":"10.1145/2487575.2487645","DOIUrl":"https://doi.org/10.1145/2487575.2487645","url":null,"abstract":"Finding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative density functions. A very popular among such functions is the average degree, whose maximization leads to the well-known densest-subgraph notion. Surprisingly enough, however, densest subgraphs are typically large graphs, with small edge density and large diameter. In this paper, we define a novel density function, which gives subgraphs of much higher quality than densest subgraphs: the graphs found by our method are compact, dense, and with smaller diameter. We show that the proposed function can be derived from a general framework, which includes other important density functions as subcases and for which we show interesting general theoretical properties. To optimize the proposed function we provide an additive approximation algorithm and a local-search heuristic. Both algorithms are very efficient and scale well to large graphs. We evaluate our algorithms on real and synthetic datasets, and we also devise several application studies as variants of our original problem. When compared with the method that finds the subgraph of the largest average degree, our algorithms return denser subgraphs with smaller diameter. Finally, we discuss new interesting research directions that our problem leaves open.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75348551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 276

Predicting the present with search engine data 用搜索引擎数据预测当下

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2492150

H. Varian

引用次数: 1

Repetition-aware content placement in navigational networks 导航网络中重复感知内容的放置

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487622

D. Erdös, Vatche Isahagian, Azer Bestavros, Evimaria Terzi

{"title":"Repetition-aware content placement in navigational networks","authors":"D. Erdös, Vatche Isahagian, Azer Bestavros, Evimaria Terzi","doi":"10.1145/2487575.2487622","DOIUrl":"https://doi.org/10.1145/2487575.2487622","url":null,"abstract":"Arguably, the most effective technique to ensure wide adoption of a concept (or product) is by repeatedly exposing individuals to messages that reinforce the concept (or promote the product). Recognizing the role of repeated exposure to a message, in this paper we propose a novel framework for the effective placement of content: Given the navigational patterns of users in a network, e.g., web graph, hyperlinked corpus, or road network, and given a model of the relationship between content-adoption and frequency of exposition, we define the repetition-aware content-placement (RACP) problem as that of identifying the set of B nodes on which content should be placed so that the expected number of users adopting that content is maximized. The key contribution of our work is the introduction of memory into the navigation process, by making user conversion dependent on the number of her exposures to that content. This dependency is captured using a conversion model that is general enough to capture arbitrary dependencies. Our solution to this general problem builds upon the notion of absorbing random walks, which we extend appropriately in order to address the technicalities of our definitions. Although we show the RACP problem to be NP-hard, we propose a general and efficient algorithmic solution. Our experimental results demonstrate the efficacy and the efficiency of our methods in multiple real-world datasets obtained from different application domains.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84266584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1