Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa
{"title":"Towards scalable summarization and visualization of large text corpora (abstract only)","authors":"Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa","doi":"10.1145/2213836.2213970","DOIUrl":"https://doi.org/10.1145/2213836.2213970","url":null,"abstract":"Society is awash with problems requiring the analysis of vast quantities of text and data. From detecting flu trends out of twitter conversations to finding scholarly works answering specific questions, we rely more and more on computers to process text for us. Text analytics is the application of computational, mathematical, and statistical models to derive information from large quantities of data coming primarily as text. Our project provides fast and effective text-analytics tools for large document collections, such as the blogosphere. We use natural language processing and database techniques to extract, collect, analyze, visualize, and archive information extracted from text. We focus on discovering relationships between entities (people, places, organizations, etc.) mentioned in one or more sources (blog posts or news articles). We built a custom solution using mostly off-the-shelf, open-source tools to provide a scalable platform for users to search and analyze large text corpora. Currently, we provide two main outlets for users to discover these relations: (1) full-text search over the documents and (2) graph visualizations of the entities and their relationships. This provides the user with succinct and easily digestible information gleaned from the corpus as a whole. For example, we can easily pose queries like which companies were bought by Google? as entity:google relation:bought. The extracted data is stored on a combination of the noSQL database CouchDB and Apache's Lucene. This combination is justified as our work-flow consists of offline batch insertions with almost no updates. Because we support specialized queries, we can forgo the flexibility of traditional SQL solutions and materialize all necessary indices, which are used to quickly query large amounts of de-normalized data using MapReduce. Lucene provides a flexible and powerful query syntax to yield relevant ranked results to the user. Moreover, its indices are synchronized by a process subscribed to the list of database changes published by CouchDB. The graph visualizations rely on CouchDB's ability to export the data in any format: we currently use a customized graph visualization relying on XML data. Finally, we use memcached to further improve the performance, especially for queries involving popular entities.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129769380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advanced partitioning techniques for massively distributed computation","authors":"Jingren Zhou, Nicolas Bruno, Wei Lin","doi":"10.1145/2213836.2213839","DOIUrl":"https://doi.org/10.1145/2213836.2213839","url":null,"abstract":"An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide several benefits, they carry certain limitations that impact developer productivity and optimization opportunities. Higher level programming languages plus conceptual data models have recently emerged to address such limitations. These languages offer a single machine programming abstraction and are able to perform sophisticated query optimization and apply efficient execution strategies. In massively distributed computation, data shuffling is typically the most expensive operation and can lead to serious performance bottlenecks if not done properly. An important optimization opportunity in this environment is that of judicious placement of repartitioning operators and choice of alternative implementations. In this paper we discuss advanced partitioning strategies, their implementation, and how they are integrated in the Microsoft Scope system. We show experimentally that our approach significantly improves performance for a large class of real-world jobs.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122881234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Amer-Yahia, Samreen Anjum, Amira Ghenai, Aysha Siddique, Sofiane Abbar, S. Madden, Adam Marcus, Mohammed El-Haddad
{"title":"MAQSA: a system for social analytics on news","authors":"S. Amer-Yahia, Samreen Anjum, Amira Ghenai, Aysha Siddique, Sofiane Abbar, S. Madden, Adam Marcus, Mohammed El-Haddad","doi":"10.1145/2213836.2213924","DOIUrl":"https://doi.org/10.1145/2213836.2213924","url":null,"abstract":"We present MAQSA, a system for social analytics on news. MAQSA provides an interactive topic-centric dashboard that summarizes news articles and social activity (e.g., comments and tweets) around them. MAQSA helps editors and publishers in newsrooms understand user engagement and audience sentiment evolution on various topics of interest. It also helps news consumers explore public reaction on articles relevant to a topic and refine their exploration via related entities, topics, articles and tweets. Given a topic, e.g., \"Gulf Oil Spill,\" or \"The Arab Spring\", MAQSA combines three key dimensions: time, geographic location, and topic to generate a detailed activity dashboard around relevant articles. The dashboard contains an annotated comment timeline and a social graph of comments. It utilizes commenters' locations to build maps of comment sentiment and topics by region of the world. Finally, to facilitate exploration, MAQSA provides listings of related entities, articles, and tweets. It algorithmically processes large collections of articles and tweets, and enables the dynamic specification of topics and dates for exploration. In this demo, participants will be invited to explore the social dynamics around articles on oil spills, the Libyan revolution, and the Arab Spring. In addition, participants will be able to define and explore their own topics dynamically.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117000347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality-sensitive hashing scheme based on dynamic collision counting","authors":"Junhao Gan, Jianlin Feng, Qiong Fang, Wilfred Ng","doi":"10.1145/2213836.2213898","DOIUrl":"https://doi.org/10.1145/2213836.2213898","url":null,"abstract":"Locality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a \"static\" compound hash function for building a hash table. In this paper, we propose to use a base of m single LSH functions to construct \"dynamic\" compound hash functions, and define a new LSH scheme called Collision Counting LSH (C2LSH). If the number of LSH functions under which a data object o collides with a query object q is greater than a pre-specified collision threhold l, then o can be regarded as a good candidate of c-approximate NN of q. This is the basic idea of C2LSH. Our theoretical studies show that, by appropriately choosing the size of LSH function base m and the collision threshold l, C2LSH can have a guarantee on query quality. Notably, the parameter m is not affected by dimensionality of data objects, which makes C2LSH especially good for high dimensional NN search. The experimental studies based on synthetic datasets and four real datasets have shown that C2LSH outperforms the state of the art method LSB-forest in high dimensional space.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124583421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Bolton, David Ing, Christopher Rebert, Kristina Lam Thai
{"title":"Declarative web application development: encapsulating dynamic JavaScript widgets (abstract only)","authors":"R. Bolton, David Ing, Christopher Rebert, Kristina Lam Thai","doi":"10.1145/2213836.2213969","DOIUrl":"https://doi.org/10.1145/2213836.2213969","url":null,"abstract":"The development of modern, highly interactive AJAX Web applications that enable dynamic visualization of data requires writing a great deal of tedious \"plumbing code\" to interface data between browser-based DOM and AJAX components, the application server, and the SQL database. Worse, each of these layers utilizes a different language. Further, much code is needed to keep the page and application states in sync using an imperative paradigm, which hurts simplicity. These factors result in a frustrating experience for today's Web developer. The FORWARD Project aims to alleviate this frustration by enabling pages that are \"rendered views\", in the SQL sense of \"view\". Our work in the project has led to a highly declarative approach whereby JavaScript/AJAX UI widgets automatically render views over the application state (database + session data + page data) without requiring the developer to tediously code how changes to the application state lead to invocation of the components' update methods. In contrast to conventional Web application development approaches, a FORWARD application involves only two languages, both declarative: an extended version of SQL, and an XML-based language for configuration and orchestration. The framework automatically handles efficient exchange of user input and changes to the underlying data, and updates the application state accordingly. The developer does not need to write any JavaScript or explicit updating code themselves. On the client side, FORWARD \"units\" wrap widgets using JavaScript to collect user input, directly display data, and reflect server-side updates to the data. On the server side, units contain Java code necessary to expose their functionality to the FORWARD framework and define their XML configuration representation. Our demo consists of a dynamically rendered webpage which internally uses AJAX to update a Google Maps widget that shows location markers for current Groupon deals in a specified area. It will illustrate that our SQL-driven approach makes this kind of rich dynamic webpage easy to write, with significant improvements in simplicity, brevity, and development time, while still providing the quality experience expected from top AJAX components. The amount of \"plumbing code\" is significantly reduced, enhancing the experience of AJAX Web application developers.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134119485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Managing large dynamic graphs efficiently","authors":"J. Mondal, A. Deshpande","doi":"10.1145/2213836.2213854","DOIUrl":"https://doi.org/10.1145/2213836.2213854","url":null,"abstract":"There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133309240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enrico Iori, A. Simitsis, Themis Palpanas, K. Wilkinson, S. Harizopoulos
{"title":"CloudAlloc: a monitoring and reservation system for compute clusters","authors":"Enrico Iori, A. Simitsis, Themis Palpanas, K. Wilkinson, S. Harizopoulos","doi":"10.1145/2213836.2213942","DOIUrl":"https://doi.org/10.1145/2213836.2213942","url":null,"abstract":"Cloud computing has emerged as a promising environment capable of providing flexibility, scalability, elasticity, fail-over mechanisms, high availability, and other important features to applications. Compute clusters are relatively easy to create and use, but tools to effectively share cluster resources are lacking. CloudAlloc addresses this problem and schedules workloads to cluster resources using allocation algorithms that can be easily changed according to the objectives of the enterprise. It also monitors resource utilization and thus, provides accountability for actual usage. CloudAlloc is a lightweight, flexible, easy-to-use tool for cluster resource allocation that has also proved useful as a research platform. We demonstrate its features and also discuss its allocation algorithms that minimize power usage. CloudAlloc was implemented and is in use at HP Labs.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"371 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124650744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Tatemura, Oliver Po, Wang-Pin Hsiung, Hakan Hacıgümüş
{"title":"Partiqle: an elastic SQL engine over key-value stores","authors":"J. Tatemura, Oliver Po, Wang-Pin Hsiung, Hakan Hacıgümüş","doi":"10.1145/2213836.2213917","DOIUrl":"https://doi.org/10.1145/2213836.2213917","url":null,"abstract":"The demo features Partiqle, a SQL engine over key-value stores as a relational alternative for the recent procedural approaches to support OLTP workloads elastically. Based on our microsharding framework [12], it employs a declarative specification, called transaction classes, of constraints applied on the transactions in a workload. We demonstrate use of a transaction class in design and analysis of OLTP workloads. We then demonstrate live-scaling of our fully functioning system on a server cluster.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114597715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Symbiosis in scale out networking and data management","authors":"Amin Vahdat","doi":"10.1145/2213836.2213903","DOIUrl":"https://doi.org/10.1145/2213836.2213903","url":null,"abstract":"This talk highlights the symbiotic relationship between data management and networking through a study of two seemingly independent trends in the traditionally separate communities: large-scale data processing and software defined networking. First, data processing at scale increasingly runs across hundreds or thousands of servers. We show that balancing network performance with computation and storage is a prerequisite to both efficient and scalable data processing. We illustrate the need for scale out networking in support of data management through a case study of TritonSort, currently the record holder for several sorting benchmarks, including GraySort and JouleSort. Our TritonSort experience shows that disk-bound workloads require 10 Gb/s provisioned bandwidth to keep up with modern processors while emerging flash workloads require 40 Gb/s fabrics at scale. We next argue for the need to apply data management techniques to enable Software Defined Networking (SDN) and Scale Out Networking. SDN promises the abstraction of a single logical network fabric rather than a collection of thousands of individual boxes. In turn, scale out networking allows network capacity (ports, bandwidth) to be expanded incrementally, rather than by wholesale fabric replacement. However, SDN requires an extensible model of both static and dynamic network properties and the ability to deliver dynamic updates to a range of network applications in a fault tolerant and low latency manner. Doing so in networking environments where updates are typically performed by timer-based broadcasts and models are specified as comma-separated text files processed by one-off scripts presents interesting challenges. For example, consider an environment where applications from routing to traffic engineering to monitoring to intrusion/anomaly detection all essentially boil down to inserting, triggering and retrieving updates to/from a shared, extensible data store.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122847227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VizDeck: self-organizing dashboards for visual analytics","authors":"A. Key, Bill Howe, D. Perry, Cecilia R. Aragon","doi":"10.1145/2213836.2213931","DOIUrl":"https://doi.org/10.1145/2213836.2213931","url":null,"abstract":"We present VizDeck, a web-based tool for exploratory visual analytics of unorganized relational data. Motivated by collaborations with domain scientists who search for complex patterns in hundreds of data sources simultaneously, VizDeck automatically recommends appropriate visualizations based on the statistical properties of the data and adopts a card game metaphor to help organize the recommended visualizations into interactive visual dashboard applications in seconds with zero programming. The demonstration allows users to derive, share, and permanently store their own dashboard from hundreds of real science datasets using a production system deployed at the University of Washington.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115978011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}