Gihwan Oh, Jae-Myung Kim, Woon-Hak Kang, Sang-Won Lee
{"title":"Reducing cache misses in hash join probing phase by pre-sorting strategy (abstract only)","authors":"Gihwan Oh, Jae-Myung Kim, Woon-Hak Kang, Sang-Won Lee","doi":"10.1145/2213836.2213971","DOIUrl":"https://doi.org/10.1145/2213836.2213971","url":null,"abstract":"Recently, several studies on multi-core cache-aware hash join have been carried out [Kim09VLDB, Blanas11SIGMOD]. In particular, the work of Blanas has shown that rather simple no-partitioning hash join can outperform the work of Kim. Meanwhile, the simple but best performing hash join of Blanas still experiences severe cache misses in probing phase. Because the key values of tuples in outer relation are not sorted or clustered, each outer record has different hashed key value and thus accesses the different hash bucket. Since the size of hash table of inner table is usually much larger than that of the CPU cache, it is highly probable that the reference to hash bucket of inner table by each outer record would encounter cache miss. To reduce the cache misses in hash join probing phase, we propose a new join algorithm, Sorted Probing (in short, SP), which pre-sorts the hashed key values of outer table of hash join so that the access to the hash bucket of inner table has strong temporal locality, thus minimizing the cache misses during the probing phase. As an optimization technique of sorting, we used the cache-aware AlphaSort technique, which extracts the key from each record of data set to be sorted and its pointer, and then sorts the pairs of (key, rec_ptr). For performance evaluation, we used two hash join algorithms from Blanas' work, no partitioning(NP) and independent partitioning(IP) in a standard C++ program, provided by Blanas. Also, we implemented the AlphaSort and added it before each probing phase of NP and IP, and we call each algorithm as NP+SP and IP+SP. For syntactic workload, IP+SP outperforms all other algorithms: IP+SP is faster than other altorithms up to 30%.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116089672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mob data sourcing","authors":"Daniel Deutch, T. Milo","doi":"10.1145/2213836.2213905","DOIUrl":"https://doi.org/10.1145/2213836.2213905","url":null,"abstract":"Crowdsourcing is an emerging paradigm that harnesses a mass of users to perform various types of tasks. We focus in this tutorial on a particular form of crowdsourcing, namely crowd (or mob) datasourcing whose goal is to obtain, aggregate or process data. We overview crowd datasourcing solutions in various contexts, explain the need for a principled solution, describe advances towards achieving such a solution, and highlight remaining gaps.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126775230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edgar F. Codd Innovations Award Talk","authors":"Bruce E. Lindsay","doi":"10.1145/2213836.2370804","DOIUrl":"https://doi.org/10.1145/2213836.2370804","url":null,"abstract":"","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129278290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aditya G. Parameswaran, H. Garcia-Molina, Hyunjung Park, N. Polyzotis, Aditya Ramesh, J. Widom
{"title":"CrowdScreen: algorithms for filtering data with humans","authors":"Aditya G. Parameswaran, H. Garcia-Molina, Hyunjung Park, N. Polyzotis, Aditya Ramesh, J. Widom","doi":"10.1145/2213836.2213878","DOIUrl":"https://doi.org/10.1145/2213836.2213878","url":null,"abstract":"Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem. We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error. We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126199411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangfu Peng, Y. Yang, Zhenjie Zhang, M. Winslett, Yong Yu
{"title":"DP-tree: indexing multi-dimensional data under differential privacy (abstract only)","authors":"Shangfu Peng, Y. Yang, Zhenjie Zhang, M. Winslett, Yong Yu","doi":"10.1145/2213836.2213972","DOIUrl":"https://doi.org/10.1145/2213836.2213972","url":null,"abstract":"e-differential privacy (e-DP) is a strong and rigorous scheme for protecting individuals' privacy while releasing useful statistical information. The main idea is to inject random noise into the results of statistical queries, such that the existence of any single record has negligible impact on the distributions of query results. The accuracy of such randomized results depends heavily upon the query processing technique, which has been an active research topic in recent years. So far, most existing methods focus on 1-dimensional queries. The only work that handles multi-dimensional query processing under e-DP is [1], which indexes the sensitive data using variants of the quad-tree and the k-d-tree. As we point out in this paper, these structures are inherently suboptimal for answering queries under e-DP. Consequently, the solutions in [1] suffer from several serious drawbacks, including limited and unstable query accuracy, as well as bias towards certain types of queries. Motivated by this, we propose the DP-tree, a novel index structure for multi-dimensional query processing under e-DP that eliminates the problems encountered by the methods in [1]. Further, we show that the effectiveness of the DP-tree can be improved using statistical information about the query workload. Extensive experiments using real and synthetic datasets confirm that the DP-tree achieves significantly higher query accuracy than existing methods. Interestingly, an adaptation of the DP-tree also outperforms previous 1D solutions in their restricted scope, by large margins.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114103785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Authenticating location-based services without compromising location privacy","authors":"Haibo Hu, Jianliang Xu, Qian Chen, Ziwei Yang","doi":"10.1145/2213836.2213871","DOIUrl":"https://doi.org/10.1145/2213836.2213871","url":null,"abstract":"The popularity of mobile social networking services (mSNSs) is propelling more and more businesses, especially those in retailing and marketing, into mobile and location-based forms. To address the trust issue, the service providers are expected to deliver their location-based services in an authenticatable manner, so that the correctness of the service results can be verified by the client. However, existing works on query authentication cannot preserve the privacy of the data being queried, which are sensitive user locations when it comes to location-based services and mSNSs. In this paper, we address this challenging problem by proposing a comprehensive solution that preserves unconditional location privacy when authenticating range queries. Three authentication schemes for $R$-tree and grid-file index, together with two optimization techniques, are developed. Cost models, security analysis, and experimental results consistently show the effectiveness, reliability and robustness of the proposed schemes under various system settings and query workloads.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131137482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaheng Lu, P. Senellart, Chunbin Lin, Xiaoyong Du, Shan Wang, Xinxing Chen
{"title":"Optimal top-k generation of attribute combinations based on ranked lists","authors":"Jiaheng Lu, P. Senellart, Chunbin Lin, Xiaoyong Du, Shan Wang, Xinxing Chen","doi":"10.1145/2213836.2213883","DOIUrl":"https://doi.org/10.1145/2213836.2213883","url":null,"abstract":"In this work, we study a novel query type, called top-k,m queries. Suppose we are given a set of groups and each group contains a set of attributes, each of which is associated with a ranked list of tuples, with ID and score. All lists are ranked in decreasing order of the scores of tuples. We are interested in finding the best combinations of attributes, each combination involving one attribute from each group. More specifically, we want the top-k combinations of attributes according to the corresponding top-m tuples with matching IDs. This problem has a wide range of applications from databases to search engines on traditional and non-traditional types of data (relational data, XML, text, etc.). We show that a straightforward extension of an optimal top-k algorithm, the Threshold Algorithm (TA), has shortcomings in solving the km problem, as it needs to compute a large number of intermediate results for each combination and reads moreinputs than needed. To overcome this weakness, we provide here, for the first time, a provably instance-optimal algorithm and further develop optimizations for efficient query evaluation to reduce computational and memory costs and the number of accesses. We demonstrate experimentally the scalability and efficiency of our algorithms over three real applications.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131192636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIGMOD Contributions Award Talk","authors":"M. Winslett","doi":"10.1145/2213836.2370916","DOIUrl":"https://doi.org/10.1145/2213836.2370916","url":null,"abstract":"","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130880728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Mongiovì, Petko Bogdanov, R. Ranca, Ambuj K. Singh, E. Papalexakis, C. Faloutsos
{"title":"SigSpot: mining significant anomalous regions from time-evolving networks (abstract only)","authors":"M. Mongiovì, Petko Bogdanov, R. Ranca, Ambuj K. Singh, E. Papalexakis, C. Faloutsos","doi":"10.1145/2213836.2213974","DOIUrl":"https://doi.org/10.1145/2213836.2213974","url":null,"abstract":"Anomaly detection in dynamic networks has a rich gamut of application domains, such as road networks, communication networks and water distribution networks. An anomalous event, such as a traffic accident, denial of service attack or a chemical spill, can cause a local shift from normal behavior in the network state that persists over an interval of time. Detecting such anomalous regions of network and time extent in large real-world networks is a challenging task. Existing anomaly detection techniques focus on either the time series associated with individual network edges or on global anomalies that affect the entire network. In order to detect anomalous regions, one needs to consider both the time and the affected network substructure jointly, which brings forth computational challenges due to the combinatorial nature of possible solutions. We propose the problem of mining all Significant Anomalous Regions (SAR) in time-evolving networks that asks for the discovery of connected temporal subgraphs comprised of edges that significantly deviate from normal in a persistent manner. We propose an optimal Baseline algorithm for the problem and an efficient approximation, called S IG S POT. Compared to Baseline, SIGSPOT is up to one order of magnitude faster in real data, while achieving less than 10% average relative error rate. In synthetic datasets it is more than 30 times faster than Baseline with 94% accuracy and solves efficiently large instances that are infeasible (more than 10 hours running time) for Baseline. We demonstrate the utility of SIGSPOT for inferring accidents on road networks and study its scalability when detecting anomalies in social, transportation and synthetic evolving networks, spanning up to 1GB.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133163709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SOFIA SEARCH: a tool for automating related-work search","authors":"Behzad Golshan, Theodoros Lappas, Evimaria Terzi","doi":"10.1145/2213836.2213915","DOIUrl":"https://doi.org/10.1145/2213836.2213915","url":null,"abstract":"When working on a new project, researchers need to devote a significant amount of time and effort to surveying the relevant literature. This is required in order to gain expertise, evaluate the significance of their work and gain useful insights about a particular scientific domain. While necessary, relevant-work search is also a time-consuming and arduous process, requiring the continuous participation of the user. In this work, we introduce Sofia Search, a tool that fully automates the search and retrieval of the literature related to a topic. Given a seed of papers submitted by the user, Sofia Search searches the Web for candidate related papers, evaluates their relevance to the seed and downloads them for the user. The tool also provides modules for the evaluation and ranking of authors and papers, in the context of the retrieved papers. In the demo, we will demonstrate the functionality of our tool, by allowing users to use it via a simple and intuitive interface.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133454666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}