{"title":"Transactional In-Page Logging for multiversion read consistency and recovery","authors":"Sang-Won Lee, Bongki Moon","doi":"10.1109/ICDE.2011.5767889","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767889","url":null,"abstract":"Recently, a new buffer and storage management strategy called In-Page Logging (IPL) has been proposed for database systems based on flash memory. Its main objective is to overcome the limitations of flash memory such as erase-before-write and asymmetric read/write speeds by storing changes made to a data page in a form of log records without overwriting the data page itself. Since it maintains a series of changes made to a data page separately from the original data page until they are merged, the IPL scheme provides unique opportunities to design light-weight transactional support for database systems. In this paper, we propose the transactional IPL (TIPL) scheme that takes advantage of the IPL log records to support multiversion read consistency and light-weight database recovery. Due to the dual use of IPL log records, namely, for snapshot isolation and fast recovery as well as flash-aware write optimization, TIPL achieves transactional support for flash memory database systems that minimizes the space and time overhead during normal database processing and shortens the database recovery time.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122967008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NORMS: An automatic tool to perform schema label normalization","authors":"S. Sorrentino, S. Bergamaschi, M. Gawinecki","doi":"10.1109/ICDE.2011.5767952","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767952","url":null,"abstract":"Schema matching is the problem of finding relationships among concepts across heterogeneous data sources (heterogeneous in format and structure). Schema matching systems usually exploit lexical and semantic information provided by lexical databases/thesauri to discover intra/inter semantic relationships among schema elements. However, most of them obtain poor performance on real world scenarios due to the significant presence of “non-dictionary words”. Non-dictionary words include compound nouns, abbreviations and acronyms. In this paper, we present NORMS (NORMalizer of Schemata), a tool performing schema label normalization to increase the number of comparable labels extracted from schemata1.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125689098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AMC - A framework for modelling and comparing matching systems as matching processes","authors":"E. Peukert, Julian Eberius, E. Rahm","doi":"10.1109/ICDE.2011.5767940","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767940","url":null,"abstract":"We present the Auto Mapping Core (AMC), a new framework that supports fast construction and tuning of schema matching approaches for specific domains such as ontology alignment, model matching or database-schema matching. Distinctive features of our framework are new visualisation techniques for modelling matching processes, stepwise tuning of parameters, intermediate result analysis and performance-oriented rewrites. Furthermore, existing matchers can be plugged into the framework to comparatively evaluate them in a common environment. This allows deeper analysis of behaviour and shortcomings in existing complex matching systems.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131718528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems","authors":"Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, Zhiwei Xu","doi":"10.1109/ICDE.2011.5767933","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767933","url":null,"abstract":"MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131274920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decomposing DAGs into spanning trees: A new way to compress transitive closures","authors":"Yangjun Chen, Yibin Chen","doi":"10.1109/ICDE.2011.5767832","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767832","url":null,"abstract":"Let G(V, E) be a digraph (directed graph) with n nodes and e edges. Digraph G* = (V, E*) is the reflexive, transitive closure if (v, u) ∈ E* iff there is a path from v to u in G. Efficient storage of G* is important for supporting reachability queries which are not only common on graph databases, but also serve as fundamental operations used in many graph algorithms. A lot of strategies have been suggested based on the graph labeling, by which each node is assigned with certain labels such that the reachability of any two nodes through a path can be determined by their labels. Among them are interval labelling, chain decomposition, and 2-hop labeling. However, due to the very large size of many real world graphs, the computational cost and size of labels using existing methods would prove too expensive to be practical. In this paper, we propose a new approach to decompose a graph into a series of spanning trees which may share common edges, to transform a reachability query over a graph into a set of queries over trees. We demonstrate both analytically and empirically the efficiency and effectiveness of our method.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133653922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed H. Ali, B. Chandramouli, J. Goldstein, R. Schindlauer
{"title":"The extensibility framework in Microsoft StreamInsight","authors":"Mohamed H. Ali, B. Chandramouli, J. Goldstein, R. Schindlauer","doi":"10.1109/ICDE.2011.5767878","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767878","url":null,"abstract":"Microsoft StreamInsight (StreamInsight, for brevity) is a platform for developing and deploying streaming applications, which need to run continuous queries over high-data-rate streams of input events. StreamInsight leverages a well-defined temporal stream model and operator algebra, as the underlying basis for processing long-running continuous queries over event streams. This allows StreamInsight to handle imperfections in event delivery and to provide correctness guarantees on the generated output. StreamInsight natively supports a diverse range of off-the-shelf streaming operators. In order to cater to a much broader range of customer scenarios and applications, StreamInsight has recently introduced a new extensibility infrastructure. With this infrastructure, StreamInsight enables developers to integrate their domain expertise within the query pipeline in the form of user defined modules (functions, operators, and aggregates). This paper describes the extensibility framework in StreamInsight; an ongoing effort at Microsoft SQL Server to support the integration of user-defined modules in a stream processing system. More specifically, the paper addresses the extensibility problem from three perspectives: the query writer's perspective, the user defined module writer's perspective, and the system's internal perspective. The paper introduces and addresses a range of new and subtle challenges that arise when we try to add extensibility to a streaming system, in a manner that is easy to use, powerful, and practical. We summarize our experience and provide future directions for supporting stream-oriented workloads in different business domains.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117046612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evangelia Kalyvianaki, W. Wiesemann, Q. Vu, D. Kuhn, P. Pietzuch
{"title":"SQPR: Stream query planning with reuse","authors":"Evangelia Kalyvianaki, W. Wiesemann, Q. Vu, D. Kuhn, P. Pietzuch","doi":"10.1109/ICDE.2011.5767851","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767851","url":null,"abstract":"When users submit new queries to a distributed stream processing system (DSPS), a query planner must allocate physical resources, such as CPU cores, memory and network bandwidth, from a set of hosts to queries. Allocation decisions must provide the correct mix of resources required by queries, while achieving an efficient overall allocation to scale in the number of admitted queries. By exploiting overlap between queries and reusing partial results, a query planner can conserve resources but has to carry out more complex planning decisions. In this paper, we describe SQPR, a query planner that targets DSPSs in data centre environments with heterogeneous resources. SQPR models query admission, allocation and reuse as a single constrained optimisation problem and solves an approximate version to achieve scalability. It prevents individual resources from becoming bottlenecks by re-planning past allocation decisions and supports different allocation objectives. As our experimental evaluation in comparison with a state-of-the-art planner shows SQPR makes efficient resource allocation decisions, even with a high utilisation of resources, with acceptable overheads.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116368209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High performance database logging using storage class memory","authors":"Ru Fang, Hui-I Hsiao, Bin He, C. Mohan, Yun Wang","doi":"10.1109/ICDE.2011.5767918","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767918","url":null,"abstract":"Storage class memory (SCM), a new generation of memory technology, offers non-volatility, high-speed, and byte-addressability, which combines the best properties of current hard disk drives (HDD) and main memory. With these extraordinary features, current systems and software stacks need to be redesigned to get significantly improved performance by eliminating disk input/output (I/O) barriers; and simpler system designs by avoiding complicated data format transformations. In current DBMSs, logging and recovery are the most important components to enforce the atomicity and durability of a database. Traditionally, database systems rely on disks for logging transaction actions and log records are forced to disks when a transaction commits. Because of the slow disk I/O speed, logging becomes one of the major bottlenecks for a DBMS. Exploiting SCM as a persistent memory for transaction logging can significantly reduce logging overhead. In this paper, we present the detailed design of an SCM-based approach for DBMSs logging, which achieves high performance by simplified system design and better concurrency support. We also discuss solutions to tackle several major issues arising during system recovery, including hole detection, partial write detection, and any-point failure recovery. This new logging approach is used to replace the traditional disk based logging approach in DBMSs. To analyze the performance characteristics of our SCM-based logging approach, we implement the prototype on IBM SolidDB. In common circumstances, our experimental results show that the new SCM-based logging approach provides as much as 7 times throughput improvement over disk-based logging in the Telecommunication Application Transaction Processing (TATP) benchmark.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116430584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dingming Wu, Man Lung Yiu, Christian S. Jensen, G. Cong
{"title":"Efficient continuously moving top-k spatial keyword query processing","authors":"Dingming Wu, Man Lung Yiu, Christian S. Jensen, G. Cong","doi":"10.1109/ICDE.2011.5767861","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767861","url":null,"abstract":"Web users and content are increasingly being geo-positioned. This development gives prominence to spatial keyword queries, which involve both the locations and textual descriptions of content.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121816176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Bernstein, Istvan Cseri, Nishant Dani, Nigel Ellis, Ajay Kalhan, Gopal Kakivaya, D. Lomet, Ramesh Manne, Lev Novik, Tomas Talius
{"title":"Adapting microsoft SQL server for cloud computing","authors":"P. Bernstein, Istvan Cseri, Nishant Dani, Nigel Ellis, Ajay Kalhan, Gopal Kakivaya, D. Lomet, Ramesh Manne, Lev Novik, Tomas Talius","doi":"10.1109/ICDE.2011.5767935","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767935","url":null,"abstract":"Cloud SQL Server is a relational database system designed to scale-out to cloud computing workloads. It uses Microsoft SQL Server as its core. To scale out, it uses a partitioned database on a shared-nothing system architecture. Transactions are constrained to execute on one partition, to avoid the need for two-phase commit. The database is replicated for high availability using a custom primary-copy replication scheme. It currently serves as the storage engine for Microsoft's Exchange Hosted Archive and SQL Azure.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130780390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}