J. Schaffner, Tim Januschowski, Mary H. Kercher, Tim Kraska, H. Plattner, M. Franklin, D. Jacobs
{"title":"RTP: robust tenant placement for elastic in-memory database clusters","authors":"J. Schaffner, Tim Januschowski, Mary H. Kercher, Tim Kraska, H. Plattner, M. Franklin, D. Jacobs","doi":"10.1145/2463676.2465302","DOIUrl":"https://doi.org/10.1145/2463676.2465302","url":null,"abstract":"In the cloud services industry, a key issue for cloud operators is to minimize operational costs. In this paper, we consider algorithms that elastically contract and expand a cluster of in-memory databases depending on tenants' behavior over time while maintaining response time guarantees.\u0000 We evaluate our tenant placement algorithms using traces obtained from one of SAP's production on-demand applications. Our experiments reveal that our approach lowers operating costs for the database cluster of this application by a factor of 2.2 to 10, measured in Amazon EC2 hourly rates, in comparison to the state of the art. In addition, we carefully study the trade-off between cost savings obtained by continuously migrating tenants and the robustness of servers towards load spikes and failures.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"44 1","pages":"773-784"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91227980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes","authors":"M. Yakout, Laure Berti-Équille, A. Elmagarmid","doi":"10.1145/2463676.2463706","DOIUrl":"https://doi.org/10.1145/2463676.2463706","url":null,"abstract":"Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"20 1","pages":"553-564"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73100375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Long, R. C. Wong, Philip S. Yu, Minhao Jiang
{"title":"On optimal worst-case matching","authors":"Cheng Long, R. C. Wong, Philip S. Yu, Minhao Jiang","doi":"10.1145/2463676.2465321","DOIUrl":"https://doi.org/10.1145/2463676.2465321","url":null,"abstract":"Bichromatic reverse nearest neighbor (BRNN) queries have been studied extensively in the literature of spatial databases. Given a set P of service-providers and a set O of customers, a BRNN query is to find which customers in O are \"interested\" in a given service-provider in P. Recently, it has been found that this kind of queries lacks the consideration of the capacities of service-providers and the demands of customers. In order to address this issue, some spatial matching problems have been proposed, which, however, cannot be used for some real-life applications like emergency facility allocation where the maximum matching cost (or distance) should be minimized. In this paper, we propose a new problem called Spatial Matching for Minimizing Maximum matching distance (SPM-MM). Then, we design two algorithms for SPM-MM, Threshold-Adapt and Swap-Chain. Threshold-Adapt is simple and easy to understand but not scalable to large datasets due to its relatively high time/space complexity. Swap-Chain, which follows a fundamentally different idea from Threshold-Adapt, runs faster than Threshold-Adapt by orders of magnitude and uses significantly less memory. We conducted extensive empirical studies which verified the efficiency and scalability of Swap-Chain.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"22 1","pages":"845-856"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73900038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Larson, C. Clinciu, Campbell Fraser, E. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan Price, Srikumar Rangarajan, Remus Rusanu, Mayukh Saubhasik
{"title":"Enhancements to SQL server column stores","authors":"P. Larson, C. Clinciu, Campbell Fraser, E. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan Price, Srikumar Rangarajan, Remus Rusanu, Mayukh Saubhasik","doi":"10.1145/2463676.2463708","DOIUrl":"https://doi.org/10.1145/2463676.2463708","url":null,"abstract":"SQL Server 2012 introduced two innovations targeted for data warehousing workloads: column store indexes and batch (vectorized) processing mode. Together they greatly improve performance of typical data warehouse queries, routinely by 10X and in some cases by a 100X or more. The main limitations of the initial version are addressed in the upcoming release. Column store indexes are updatable and can be used as the base storage for a table. The repertoire of batch mode operators has been expanded, existing operators have been improved, and query optimization has been enhanced. This paper gives an overview of SQL Server's column stores and batch processing, in particular the enhancements introduced in the upcoming release.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"78 1","pages":"1159-1168"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73004273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ananthanarayanan, Venkatesh Basker, Sumit Das, A. Gupta, H. Jiang, Tianhao Qiu, Alexey Reznichenko, D.Yu. Ryabkov, Manpreet Singh, S. Venkataraman
{"title":"Photon: fault-tolerant and scalable joining of continuous data streams","authors":"R. Ananthanarayanan, Venkatesh Basker, Sumit Das, A. Gupta, H. Jiang, Tianhao Qiu, Alexey Reznichenko, D.Yu. Ryabkov, Manpreet Singh, S. Venkataraman","doi":"10.1145/2463676.2465272","DOIUrl":"https://doi.org/10.1145/2463676.2465272","url":null,"abstract":"Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually.\u0000 Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"132 1","pages":"577-588"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76421080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CS2: a new database synopsis for query estimation","authors":"Feng Yu, W. Hou, Cheng Luo, D. Che, Mengxia Zhu","doi":"10.1145/2463676.2463701","DOIUrl":"https://doi.org/10.1145/2463676.2463701","url":null,"abstract":"Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"35 1","pages":"469-480"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76733782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big data in capital markets","authors":"A. Nazaruk, M. Rauchman","doi":"10.1145/2463676.2486082","DOIUrl":"https://doi.org/10.1145/2463676.2486082","url":null,"abstract":"Over the past decade global securities markets have dramatically changed. Evolution of market structure in combination with advances in computer technologies led to emergence of electronic securities trading. Securities transactions that used to be conducted in person and over the phone are now predominantly executed by automated trading systems. This resulted in significant fragmentation of the markets, vast increase in the exchange volumes and even greater increase in the number of orders.\u0000 In this talk we present and analyze forces behind the wide proliferation of electronic securities trading in US stocks and options markets. We also make a high-level introduction into electronic securities market structure. We discuss trading objectives of different classes of market participants and analyze how their activity affects data volumes. We also present typical securities trading firm data flow and analyze various types of data it uses in its trading operations.\u0000 We close with the implications this \"sea change\" has on DBMS requirements in capital markets.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"15 1","pages":"917-918"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79619364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alekh Jindal, Jorge-Arnulfo Quiané-Ruiz, S. Madden
{"title":"CARTILAGE: adding flexibility to the Hadoop skeleton","authors":"Alekh Jindal, Jorge-Arnulfo Quiané-Ruiz, S. Madden","doi":"10.1145/2463676.2465258","DOIUrl":"https://doi.org/10.1145/2463676.2465258","url":null,"abstract":"Modern enterprises have to deal with a variety of analytical queries over very large datasets. In this respect, Hadoop has gained much popularity since it scales to thousand of nodes and terabytes of data. However, Hadoop suffers from poor performance, especially in I/O performance. Several works have proposed alternate data storage for Hadoop in order to improve the query performance. However, many of these works end up making deep changes in Hadoop or HDFS. As a result, they are (i) difficult to adopt by several users, and (ii) not compatible with future Hadoop releases. In this paper, we present CARTILAGE, a comprehensive data storage framework built on top of HDFS. CARTILAGE allows users full control over their data storage, including data partitioning, data replication, data layouts, and data placement. Furthermore, CARTILAGE can be layered on top of an existing HDFS installation. This means that Hadoop, as well as other query engines, can readily make use of CARTILAGE. We describe several use-cases of CARTILAGE and propose to demonstrate the flexibility and efficiency of CARTILAGE through a set of novel scenarios.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"85 1","pages":"1057-1060"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79703805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vivek R. Narasayya, Sudipto Das, M. Syamala, B. Chandramouli, S. Chaudhuri
{"title":"A demonstration of SQLVM: performance isolation in multi-tenant relational database-as-a-service","authors":"Vivek R. Narasayya, Sudipto Das, M. Syamala, B. Chandramouli, S. Chaudhuri","doi":"10.1145/2463676.2463686","DOIUrl":"https://doi.org/10.1145/2463676.2463686","url":null,"abstract":"Sharing resources of a single database server among multiple tenants is common in multi-tenant Database-as-a-Service providers, such as Microsoft SQL Azure. Multi-tenancy enables cost reduction for the cloud service provider which it can pass on as savings to the tenants. However, resource sharing can adversely affect a tenant's performance due to other tenants' workloads contending for shared resources. Service providers today do not provide any assurances to a tenant in terms of isolating its performance from other co-located tenants. SQLVM, a project at Microsoft Research, is an abstraction for performance isolation which is built on a promise of reserving key database server resources, such as CPU, I/O and memory, for each tenant. The key challenge is in supporting this abstraction within a RDBMS without statically allocating resources to tenants, while ensuring low overheads and scaling to large numbers of tenants. This demonstration will show how SQLVM can effectively isolate a tenant's performance from other tenant workloads co-located at the same database server. Our demonstration will use various scripted scenarios and a data collection and visualization framework to illustrate performance isolation using SQLVM.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"12 1","pages":"1077-1080"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89322704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek Mukherji, Xika Lin, Christopher R. Botaish, Jason Whitehouse, Elke A. Rundensteiner, M. Ward, Carolina Ruiz
{"title":"PARAS: interactive parameter space exploration for association rule mining","authors":"Abhishek Mukherji, Xika Lin, Christopher R. Botaish, Jason Whitehouse, Elke A. Rundensteiner, M. Ward, Carolina Ruiz","doi":"10.1145/2463676.2465245","DOIUrl":"https://doi.org/10.1145/2463676.2465245","url":null,"abstract":"We demonstrate our PARAS technology for supporting interactive association mining at near real-time speeds. Key technical innovations of PARAS, in particular, stable region abstractions and rule redundancy management supporting novel parameter space-centric exploratory queries will be showcased. The audience will be able to interactively explore the parameter space view of rules. They will experience near real-time speeds achieved by PARAS for operations, such as comparing rule sets mined using different parameter values, that would otherwise take hours of computation and much manual investigation. Overall, we will demonstrate that the PARAS system provides a rich experience to data analysts through parameter tuning recommendations while significantly reducing the trial-and-error interactions.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"39 1","pages":"1017-1020"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88163181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}