Navendu Jain, Lisa Amini, H. Andrade, Richard P. King, Yoonho Park, P. Selo, C. Venkatramani
{"title":"Design, implementation, and evaluation of the linear road bnchmark on the stream processing core","authors":"Navendu Jain, Lisa Amini, H. Andrade, Richard P. King, Yoonho Park, P. Selo, C. Venkatramani","doi":"10.1145/1142473.1142522","DOIUrl":"https://doi.org/10.1145/1142473.1142522","url":null,"abstract":"Stream processing applications have recently gained significant attention in the networking and database community. At the core of these applications is a stream processing engine that performs resource allocation and management to support continuous tracking of queries over collections of physically-distributed and rapidly-updating data streams. While numerous stream processing systems exist, there has been little work on understanding the performance characteristics of these applications in a distributed setup. In this paper, we examine the performance bottlenecks of streaming data applications, in particular the Linear Road stream data management benchmark, in achieving good performance in large-scale distributed environments, using the Stream Processing Core (SPC), a stream processing middleware we have developed. First, we present the design and implementation of the Linear Road benchmark on the SPC middleware. SPC has been designed to scale to tens of thousands of processing nodes, while supporting concurrent applications and multiple simultaneous queries. Second, we identify the main performance bottlenecks in the Linear Road application in achieving scalability and low query response latency. Our results show that data locality, buffer capacity, physical allocation of processing elements to infrastructure nodes, and packaging for transporting streamed data are important factors in achieving good application performance. Though we evaluate our system primarily for the Linear Road application, we believe it also provides useful insights into the overall system behavior for supporting other distributed and large-scale continuous streaming data applications. Finally, we examine how SPC can be used and tuned to enable a very efficient implementation of the Linear Road application in a distributed environment.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"474 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132318458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Contour map matching for event detection in sensor networks","authors":"Wenwei Xue, Qiong Luo, Lei Chen, Yunhao Liu","doi":"10.1145/1142473.1142491","DOIUrl":"https://doi.org/10.1145/1142473.1142491","url":null,"abstract":"Many sensor network applications, such as object tracking and disaster monitoring, require effective techniques for event detection. In this paper, we propose a novel event detection mechanism based on matching the contour maps of in-network sensory data distribution. Our key observation is that events in sensor networks can be abstracted into spatio-temporal patterns of sensory data and that pattern matching can be done efficiently through contour map matching. Therefore, we propose simple SQL extensions to allow users to specify common types of events as patterns in contour maps and study energy-efficient techniques of contour map construction and maintenance for our pattern-based event detection. Our experiments with synthetic workloads derived from a real-world coal mine surveillance application validate the effectiveness and efficiency of our approach.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132109343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal multi-scale patterns in time series streams","authors":"S. Papadimitriou, Philip S. Yu","doi":"10.1145/1142473.1142545","DOIUrl":"https://doi.org/10.1145/1142473.1142545","url":null,"abstract":"We introduce a method to discover optimal local patterns, which concisely describe the main trends in a time series. Our approach examines the time series at multiple time scales (i.e., window sizes) and efficiently discovers the key patterns in each. We also introduce a criterion to select the best window sizes, which most concisely capture the key oscillatory as well as aperiodic trends. Our key insight lies in learning an optimal orthonormal transform from the data itself, as opposed to using a predetermined basis or approximating function (such as piecewise constant, short-window Fourier or wavelets), which essentially restricts us to a particular family of trends. We go one step further, lifting even that limitation. Furthermore, our method lends itself to fast, incremental estimation in a streaming setting. Experimental evaluation shows that our method can capture meaningful patterns in a variety of settings. Our streaming approach requires order of magnitude less time and space, while still producing concise and informative patterns.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133425037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas Gerner, Fan Yang, A. Demers, J. Gehrke, Mirek Riedewald, J. Shanmugasundaram
{"title":"Automatic client-server partitioning of data-driven web applications","authors":"Nicholas Gerner, Fan Yang, A. Demers, J. Gehrke, Mirek Riedewald, J. Shanmugasundaram","doi":"10.1145/1142473.1142580","DOIUrl":"https://doi.org/10.1145/1142473.1142580","url":null,"abstract":"Current application development tools provide completely different programming models for the application server (e.g., Java and J2EE) and the client web browser (e.g., JavaScript and HTML). Consequently, the application developer is forced to partition the application code between the server and client at the time of writing the application. However, the partitioning of the code between the client and server may have to be changed during the evolution of the application for performance reasons (it may be better to push more functionality to the client), for correctness reasons (data that conflicts with multiple clients cannot always be pushed to clients), and for supporting clients with different computing power (browsers on desktops vs. PDAs). Since the client and server use different programming models, moving application code from client to server (and vice versa) reduces programmer productivity and also has the potential to introduce concurrency bugs. In this demonstration, we advocate an alternative solution to this problem: we propose developing applications using a unified declarative high-level language called Hilda, and show how a Hilda compiler can automatically (and correctly) partition Hilda code between the client and the server using a real Course Management System application. We illustrate our techniques using two clients: a powerful laptop machine and a less powerful PDA.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133154356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Provenance management in curated databases","authors":"P. Buneman, Adriane P. Chapman, J. Cheney","doi":"10.1145/1142473.1142534","DOIUrl":"https://doi.org/10.1145/1142473.1142534","url":null,"abstract":"Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131583577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing database applications","authors":"Carsten Binnig, Donald Kossmann, Eric Lo","doi":"10.1145/1142473.1142572","DOIUrl":"https://doi.org/10.1145/1142473.1142572","url":null,"abstract":"Testing database application is challenging because most methods and tools developed for application testing do not consider the database state during the test. In this paper we demonstrate three different tools for testing database applications: HTDGen, HTTrace and HTPar. HTDGen generates meaningful test databases for database applications. HTTrace executes database applications testing efficiently and HTPar extends HTTrace to run tests in parallel.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"617 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115826923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Record linkage: similarity measures and algorithms","authors":"Nick Koudas, Sunita Sarawagi, D. Srivastava","doi":"10.1145/1142473.1142599","DOIUrl":"https://doi.org/10.1145/1142473.1142599","url":null,"abstract":"This tutorial provides a comprehensive and cohesive overview of the key research results in the area of record linkage methodologies and algorithms for identifying approximate duplicate records, and available tools for this purpose. It encompasses techniques introduced in several communities including databases, information retrieval, statistics and machine learning. It aims to identify similarities and differences across the techniques as well as their merits and limitations.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114145887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VGM: visual graph mining","authors":"K. Borgwardt, Sebastian Böttger, H. Kriegel","doi":"10.1145/1142473.1142570","DOIUrl":"https://doi.org/10.1145/1142473.1142570","url":null,"abstract":"As more and more graph data become available in various application domains, graph mining is of ever increasing importance in data management.Graph kernels are a novel and successful method for data mining in graphs. Unfortunately, implementing graph kernels is not trivial, and few applied researchers have therefore used graph kernels so far. In this demonstration, we present a Java software package called Visual Graph Mining (VGM). VGM allows the user to classify graphs using graph kernels and Support Vector Machines in a graphical user interface that is easy to learn and use. It is linked to LIBSVM for Support Vector Machine computations, yet can be easily transferred to other Support Vector Machine packages. Furthermore, VGM provides basic data mining features such as Nearest Neighbor search, graph algorithms such as Dijkstra, Floyd-Warshall, and computes and visualizes product graphs and topological indices of graphs.VGM 's homepage can be found at: http://www.cip.ifi.lmu.de/~boettger/sigmod.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121040389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data delivery in a service-oriented world: the BEA aquaLogic data services platform","authors":"M. Carey","doi":"10.1145/1142473.1142551","DOIUrl":"https://doi.org/10.1145/1142473.1142551","url":null,"abstract":"\"Wow. I fell asleep listening to SOA music, and when I woke up, I couldn't remember where I'd put my data. Now what?\" Has this happened to you? With the new push towards service-oriented architectures (SOA) and process orientation, data seems to have been lost in the shuffle. At the end of the day, however, applications are still about data, and SOA applications are no different. In this paper, we present BEA's approach to serving up data to SOA applications. BEA recently introduced a new middleware product called the AquaLogic Data Services Platform (ALDSP). The purpose of ALDSP is to make it easy to design, develop, deploy, and maintain a data services layer in the world of service-oriented architecture. ALDSP provides a new, declarative foundation for building SOA applications and services that need to access and compose information from a range of enterprise data sources. The paper covers both the foundation and the key features of ALDSP, including its underlying technologies, its overall system architecture, and its most interesting capabilities.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117044491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MauveDB: supporting model-based user views in database systems","authors":"A. Deshpande, S. Madden","doi":"10.1145/1142473.1142483","DOIUrl":"https://doi.org/10.1145/1142473.1142483","url":null,"abstract":"Real-world data --- especially when generated by distributed measurement infrastructures such as sensor networks --- tends to be incomplete, imprecise, and erroneous, making it impossible to present it to users or feed it directly into applications. The traditional approach to dealing with this problem is to first process the data using statistical or probabilistic models that can provide more robust interpretations of the data. Current database systems, however, do not provide adequate support for applying models to such data, especially when those models need to be frequently updated as new data arrives in the system. Hence, most scientists and engineers who depend on models for managing their data do not use database systems for archival or querying at all; at best, databases serve as a persistent raw data store.In this paper we define a new abstraction called model-based views and present the architecture of MauveDB, the system we are building to support such views. Just as traditional database views provide logical data independence, model-based views provide independence from the details of the underlying data generating mechanism and hide the irregularities of the data by using models to present a consistent view to the users. MauveDB supports a declarative language for defining model-based views, allows declarative querying over such views using SQL, and supports several different materialization strategies and techniques to efficiently maintain them in the face of frequent updates. We have implemented a prototype system that currently supports views based on regression and interpolation, using the Apache Derby open source DBMS, and we present results that show the utility and performance benefits that can be obtained by supporting several different types of model-based views in a database system.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127140288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}