Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data最新文献_第9页

Towards scalable summarization and visualization of large text corpora (abstract only) 面向大型文本语料库的可扩展摘要和可视化(仅限摘要)

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213970

Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa

{"title":"Towards scalable summarization and visualization of large text corpora (abstract only)","authors":"Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa","doi":"10.1145/2213836.2213970","DOIUrl":"https://doi.org/10.1145/2213836.2213970","url":null,"abstract":"Society is awash with problems requiring the analysis of vast quantities of text and data. From detecting flu trends out of twitter conversations to finding scholarly works answering specific questions, we rely more and more on computers to process text for us. Text analytics is the application of computational, mathematical, and statistical models to derive information from large quantities of data coming primarily as text. Our project provides fast and effective text-analytics tools for large document collections, such as the blogosphere. We use natural language processing and database techniques to extract, collect, analyze, visualize, and archive information extracted from text. We focus on discovering relationships between entities (people, places, organizations, etc.) mentioned in one or more sources (blog posts or news articles). We built a custom solution using mostly off-the-shelf, open-source tools to provide a scalable platform for users to search and analyze large text corpora. Currently, we provide two main outlets for users to discover these relations: (1) full-text search over the documents and (2) graph visualizations of the entities and their relationships. This provides the user with succinct and easily digestible information gleaned from the corpus as a whole. For example, we can easily pose queries like which companies were bought by Google? as entity:google relation:bought. The extracted data is stored on a combination of the noSQL database CouchDB and Apache's Lucene. This combination is justified as our work-flow consists of offline batch insertions with almost no updates. Because we support specialized queries, we can forgo the flexibility of traditional SQL solutions and materialize all necessary indices, which are used to quickly query large amounts of de-normalized data using MapReduce. Lucene provides a flexible and powerful query syntax to yield relevant ranked results to the user. Moreover, its indices are synchronized by a process subscribed to the list of database changes published by CouchDB. The graph visualizations rely on CouchDB's ability to export the data in any format: we currently use a customized graph visualization relying on XML data. Finally, we use memcached to further improve the performance, especially for queries involving popular entities.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129769380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Advanced partitioning techniques for massively distributed computation 大规模分布式计算的高级分区技术

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213839

Jingren Zhou, Nicolas Bruno, Wei Lin

{"title":"Advanced partitioning techniques for massively distributed computation","authors":"Jingren Zhou, Nicolas Bruno, Wei Lin","doi":"10.1145/2213836.2213839","DOIUrl":"https://doi.org/10.1145/2213836.2213839","url":null,"abstract":"An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide several benefits, they carry certain limitations that impact developer productivity and optimization opportunities. Higher level programming languages plus conceptual data models have recently emerged to address such limitations. These languages offer a single machine programming abstraction and are able to perform sophisticated query optimization and apply efficient execution strategies. In massively distributed computation, data shuffling is typically the most expensive operation and can lead to serious performance bottlenecks if not done properly. An important optimization opportunity in this environment is that of judicious placement of repartitioning operators and choice of alternative implementations. In this paper we discuss advanced partitioning strategies, their implementation, and how they are integrated in the Microsoft Scope system. We show experimentally that our approach significantly improves performance for a large class of real-world jobs.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122881234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

MAQSA: a system for social analytics on news MAQSA:一个新闻社会分析系统

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213924

S. Amer-Yahia, Samreen Anjum, Amira Ghenai, Aysha Siddique, Sofiane Abbar, S. Madden, Adam Marcus, Mohammed El-Haddad

{"title":"MAQSA: a system for social analytics on news","authors":"S. Amer-Yahia, Samreen Anjum, Amira Ghenai, Aysha Siddique, Sofiane Abbar, S. Madden, Adam Marcus, Mohammed El-Haddad","doi":"10.1145/2213836.2213924","DOIUrl":"https://doi.org/10.1145/2213836.2213924","url":null,"abstract":"We present MAQSA, a system for social analytics on news. MAQSA provides an interactive topic-centric dashboard that summarizes news articles and social activity (e.g., comments and tweets) around them. MAQSA helps editors and publishers in newsrooms understand user engagement and audience sentiment evolution on various topics of interest. It also helps news consumers explore public reaction on articles relevant to a topic and refine their exploration via related entities, topics, articles and tweets. Given a topic, e.g., \"Gulf Oil Spill,\" or \"The Arab Spring\", MAQSA combines three key dimensions: time, geographic location, and topic to generate a detailed activity dashboard around relevant articles. The dashboard contains an annotated comment timeline and a social graph of comments. It utilizes commenters' locations to build maps of comment sentiment and topics by region of the world. Finally, to facilitate exploration, MAQSA provides listings of related entities, articles, and tweets. It algorithmically processes large collections of articles and tweets, and enables the dynamic specification of topics and dates for exploration. In this demo, participants will be invited to explore the social dynamics around articles on oil spills, the Libyan revolution, and the Arab Spring. In addition, participants will be able to define and explore their own topics dynamically.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117000347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Locality-sensitive hashing scheme based on dynamic collision counting 基于动态碰撞计数的位置敏感哈希方案

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213898

Junhao Gan, Jianlin Feng, Qiong Fang, Wilfred Ng

{"title":"Locality-sensitive hashing scheme based on dynamic collision counting","authors":"Junhao Gan, Jianlin Feng, Qiong Fang, Wilfred Ng","doi":"10.1145/2213836.2213898","DOIUrl":"https://doi.org/10.1145/2213836.2213898","url":null,"abstract":"Locality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a \"static\" compound hash function for building a hash table. In this paper, we propose to use a base of m single LSH functions to construct \"dynamic\" compound hash functions, and define a new LSH scheme called Collision Counting LSH (C2LSH). If the number of LSH functions under which a data object o collides with a query object q is greater than a pre-specified collision threhold l, then o can be regarded as a good candidate of c-approximate NN of q. This is the basic idea of C2LSH. Our theoretical studies show that, by appropriately choosing the size of LSH function base m and the collision threshold l, C2LSH can have a guarantee on query quality. Notably, the parameter m is not affected by dimensionality of data objects, which makes C2LSH especially good for high dimensional NN search. The experimental studies based on synthetic datasets and four real datasets have shown that C2LSH outperforms the state of the art method LSB-forest in high dimensional space.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124583421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 196

Declarative web application development: encapsulating dynamic JavaScript widgets (abstract only) 声明式web应用程序开发:封装动态JavaScript小部件(仅抽象)

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213969

R. Bolton, David Ing, Christopher Rebert, Kristina Lam Thai

{"title":"Declarative web application development: encapsulating dynamic JavaScript widgets (abstract only)","authors":"R. Bolton, David Ing, Christopher Rebert, Kristina Lam Thai","doi":"10.1145/2213836.2213969","DOIUrl":"https://doi.org/10.1145/2213836.2213969","url":null,"abstract":"The development of modern, highly interactive AJAX Web applications that enable dynamic visualization of data requires writing a great deal of tedious \"plumbing code\" to interface data between browser-based DOM and AJAX components, the application server, and the SQL database. Worse, each of these layers utilizes a different language. Further, much code is needed to keep the page and application states in sync using an imperative paradigm, which hurts simplicity. These factors result in a frustrating experience for today's Web developer. The FORWARD Project aims to alleviate this frustration by enabling pages that are \"rendered views\", in the SQL sense of \"view\". Our work in the project has led to a highly declarative approach whereby JavaScript/AJAX UI widgets automatically render views over the application state (database + session data + page data) without requiring the developer to tediously code how changes to the application state lead to invocation of the components' update methods. In contrast to conventional Web application development approaches, a FORWARD application involves only two languages, both declarative: an extended version of SQL, and an XML-based language for configuration and orchestration. The framework automatically handles efficient exchange of user input and changes to the underlying data, and updates the application state accordingly. The developer does not need to write any JavaScript or explicit updating code themselves. On the client side, FORWARD \"units\" wrap widgets using JavaScript to collect user input, directly display data, and reflect server-side updates to the data. On the server side, units contain Java code necessary to expose their functionality to the FORWARD framework and define their XML configuration representation. Our demo consists of a dynamically rendered webpage which internally uses AJAX to update a Google Maps widget that shows location markers for current Groupon deals in a specified area. It will illustrate that our SQL-driven approach makes this kind of rich dynamic webpage easy to write, with significant improvements in simplicity, brevity, and development time, while still providing the quality experience expected from top AJAX components. The amount of \"plumbing code\" is significantly reduced, enhancing the experience of AJAX Web application developers.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134119485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Managing large dynamic graphs efficiently 有效地管理大型动态图形

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213854

J. Mondal, A. Deshpande

{"title":"Managing large dynamic graphs efficiently","authors":"J. Mondal, A. Deshpande","doi":"10.1145/2213836.2213854","DOIUrl":"https://doi.org/10.1145/2213836.2213854","url":null,"abstract":"There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133309240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 114

CloudAlloc: a monitoring and reservation system for compute clusters CloudAlloc:用于计算集群的监控和预留系统

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213942

Enrico Iori, A. Simitsis, Themis Palpanas, K. Wilkinson, S. Harizopoulos

引用次数: 5

Partiqle: an elastic SQL engine over key-value stores 粒子:键值存储上的弹性SQL引擎

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213917

J. Tatemura, Oliver Po, Wang-Pin Hsiung, Hakan Hacıgümüş

引用次数: 19

Symbiosis in scale out networking and data management 协同扩展网络和数据管理

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213903

Amin Vahdat

{"title":"Symbiosis in scale out networking and data management","authors":"Amin Vahdat","doi":"10.1145/2213836.2213903","DOIUrl":"https://doi.org/10.1145/2213836.2213903","url":null,"abstract":"This talk highlights the symbiotic relationship between data management and networking through a study of two seemingly independent trends in the traditionally separate communities: large-scale data processing and software defined networking. First, data processing at scale increasingly runs across hundreds or thousands of servers. We show that balancing network performance with computation and storage is a prerequisite to both efficient and scalable data processing. We illustrate the need for scale out networking in support of data management through a case study of TritonSort, currently the record holder for several sorting benchmarks, including GraySort and JouleSort. Our TritonSort experience shows that disk-bound workloads require 10 Gb/s provisioned bandwidth to keep up with modern processors while emerging flash workloads require 40 Gb/s fabrics at scale. We next argue for the need to apply data management techniques to enable Software Defined Networking (SDN) and Scale Out Networking. SDN promises the abstraction of a single logical network fabric rather than a collection of thousands of individual boxes. In turn, scale out networking allows network capacity (ports, bandwidth) to be expanded incrementally, rather than by wholesale fabric replacement. However, SDN requires an extensible model of both static and dynamic network properties and the ability to deliver dynamic updates to a range of network applications in a fault tolerant and low latency manner. Doing so in networking environments where updates are typically performed by timer-based broadcasts and models are specified as comma-separated text files processed by one-off scripts presents interesting challenges. For example, consider an environment where applications from routing to traffic engineering to monitoring to intrusion/anomaly detection all essentially boil down to inserting, triggering and retrieving updates to/from a shared, extensible data store.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122847227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VizDeck: self-organizing dashboards for visual analytics VizDeck:用于可视化分析的自组织仪表板

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI: 10.1145/2213836.2213931

A. Key, Bill Howe, D. Perry, Cecilia R. Aragon

引用次数: 146