{"title":"Beyond P vs. NP: Quadratic-Time Hardness for Big Data Problems","authors":"P. Indyk","doi":"10.1145/3087556.3087603","DOIUrl":"https://doi.org/10.1145/3087556.3087603","url":null,"abstract":"The theory of NP-hardness has been very successful in identifying problems that are unlikely to be solvable in polynomial time. However, many other important problems do have polynomial time algorithms, but large exponents in their time bounds can make them run for days, weeks or more. For example, quadratic time algorithms, although practical on moderately sized inputs, can become inefficient on big data problems that involve gigabytes or more of data. Although for many problems no sub-quadratic time algorithms are known, any evidence of quadratic-time hardness has remained elusive. In this talk I will give an overview of recent research that aims to remedy this situation. In particular, I will describe hardness results for problems in string processing (e.g., edit distance computation or regular expression matching) and machine learning (e.g., Support Vector Machines or gradient computation in neural networks). All of them have polynomial time algorithms, but despite extensive amount of research, no near-linear time algorithms have been found for many variants of these problems. I will show that, under a natural complexity-theoretic conjecture, such algorithms do not exist. I will also describe how this framework has led to the development of new algorithms.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114997676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing","authors":"Laxman Dhulipala, G. Blelloch, Julian Shun","doi":"10.1145/3087556.3087580","DOIUrl":"https://doi.org/10.1145/3087556.3087580","url":null,"abstract":"Existing graph-processing frameworks let users develop efficient implementations for many graph problems, but none of them support efficiently bucketing vertices, which is needed for bucketing-based graph algorithms such as Delta-stepping and approximate set-cover. Motivated by the lack of simple, scalable, and efficient implementations of bucketing-based algorithms, we develop the Julienne framework, which extends a recent shared-memory graph processing framework called Ligra with an interface for maintaining a collection of buckets under vertex insertions and bucket deletions. We provide a theoretically efficient parallel implementation of our bucketing interface and study several bucketing-based algorithms that make use of it (either bucketing by remaining degree or by distance) to improve performance: the peeling algorithm for k-core (coreness), Delta-stepping, weighted breadth-first search, and approximate set cover. The implementations are all simple and concise (under 100 lines of code). Using our interface, we develop the first work-efficient parallel algorithm for k-core in the literature with nontrivial parallelism. We experimentally show that our bucketing implementation scales well and achieves high throughput on both synthetic and real-world workloads. Furthermore, the bucketing-based algorithms written in Julienne achieve up to 43x speedup on 72 cores with hyper-threading over well-tuned sequential baselines, significantly outperform existing work-inefficient implementations in Ligra, and either outperform or are competitive with existing special-purpose parallel codes for the same problem. We experimentally study our implementations on the largest publicly available graphs and show that they scale well in practice, processing real-world graphs with billions of edges in seconds, and hundreds of billions of edges in a few minutes. As far as we know, this is the first time that graphs at this scale have been analyzed in the main memory of a single multicore machine.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128729773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: SESSION 1","authors":"F. Heide","doi":"10.1145/3257324","DOIUrl":"https://doi.org/10.1145/3257324","url":null,"abstract":"","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"88 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129051307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Some Sequential Algorithms are Almost Always Parallel","authors":"G. Blelloch","doi":"10.1145/3087556.3087602","DOIUrl":"https://doi.org/10.1145/3087556.3087602","url":null,"abstract":"Over the years many interesting and efficient parallel algorithms have been developed to solve a wide variety of problems, but not much attention has been paid to studying the inherent parallelism in sequential algorithms---i.e., understanding the depth of their dependence structure, and how shallow dependence structures might beused to develop efficient parallel implementations. In this talk I will describe recent work on analyzing the dependence depth of iterative sequential algorithms---ones that loop over a collection of elements. Many of these algorithms have deep dependence chains in the worst case, but shallow chains (polylog w.h.p.) if the elements are randomly ordered. Examples include many fundamental algorithms: the Knuth shuffle for random permutations, sorting by insertion into a binary search tree, greedy maximal independent set (MIS), greedy maximal matching, greedy graph-coloring, counting cycles in a permutation, incremental k-dimensional linear programming, and incremental 2d Delaunay triangulation. An advantage of the approach is that it can lead to very simple and efficient parallel algorithms. Our MIS algorithm, for example can be coded in a dozen or so lines, and is significantly faster than Luby's algorithm on modern multicore machines. Also the approach encourages snapping the view that sequential and parallel algorithms are distinct, and instead thinking of algorithms, in general, as collections of instructions with dependences among them.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117209886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent Data Structures for Near-Memory Computing","authors":"Zhiyu Liu, I. Calciu, M. Herlihy, O. Mutlu","doi":"10.1145/3087556.3087582","DOIUrl":"https://doi.org/10.1145/3087556.3087582","url":null,"abstract":"The performance gap between memory and CPU has grown exponentially. To bridge this gap, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located close to memory. Due to its proximity to memory, a memory access from a PIM core is much faster than that from a CPU core. New advances in 3D integration and die-stacked memory make PIM viable in the near future. Prior work has shown significant performance improvements by using PIM for embarrassingly parallel and data-intensive applications, as well as for pointer-chasing traversals in sequential data structures. However, current server machines have hundreds of cores, and algorithms for concurrent data structures exploit these cores to achieve high throughput and scalability, with significant benefits over sequential data structures. Thus, it is important to examine how PIM performs with respect to modern concurrent data structures and understand how concurrent data structures can be developed to take advantage of PIM. This paper is the first to examine the design of concurrent data structures for PIM. We show two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrent data structures, such as pointer-chasing data structures and FIFO queues, (2) novel designs for PIM data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structures, with a significantly simpler design.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125820969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brief Announcement: Approximation Algorithms for Unsplittable Resource Allocation Problems with Diseconomies of Scale","authors":"Antje Bjelde, Max Klimm, Daniel Schmand","doi":"10.1145/3087556.3087597","DOIUrl":"https://doi.org/10.1145/3087556.3087597","url":null,"abstract":"We study general resource allocation problems with a diseconomy of scale. Given a finite set of commodities that request certain resources, the cost of each resource grows superlinearly with the demand for it, and our goal is to minimize the total cost of the resources. In large systems with limited coordination, it is natural to consider local dynamics where in each step a single commodity switches its allocated resources whenever the new solution after the switch has smaller total cost over all commodities. This yields a deterministic and polynomial time algorithm with approximation factor arbitrarily close to the locality gap, i.e., the worst case ratio of the cost of a local optimal and a global optimal solution. For costs that are polynomials with non-negative coefficients and maximal degree d, we provide a locality gap for weighted problems that is tight for all values of d. For unweighted problems, the locality gap asymptotically matches the approximation guarantee of the currently best known centralized algorithm [Makarychev, Srividenko FOCS14] but only requires local knowledge of the commodities.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134327399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brief Announcement: Parallel Dynamic Tree Contraction via Self-Adjusting Computation","authors":"Umut A. Acar, V. Aksenov, Sam Westrick","doi":"10.1145/3087556.3087595","DOIUrl":"https://doi.org/10.1145/3087556.3087595","url":null,"abstract":"Dynamic algorithms are used to compute a property of some data while the data undergoes changes over time. Many dynamic algorithms have been proposed but nearly all are sequential. In this paper, we present our ongoing work on designing a parallel algorithm for the dynamic trees problem, which requires computing a property of a forest as the forest undergoes changes. Our algorithm allows insertion and/or deletion of both vertices and edges anywhere in the input and performs updates in parallel. We obtain our algorithm by applying a dynamization technique called self-adjusting computation to the classic algorithm of Miller and Reif for tree contraction.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120949453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Epasto, V. Mirrokni, Morteza Zadimoghaddam
{"title":"Bicriteria Distributed Submodular Maximization in a Few Rounds","authors":"Alessandro Epasto, V. Mirrokni, Morteza Zadimoghaddam","doi":"10.1145/3087556.3087574","DOIUrl":"https://doi.org/10.1145/3087556.3087574","url":null,"abstract":"We study the problem of efficiently optimizing submodular functions under cardinality constraints in distributed setting. Recently, several distributed algorithms for this problem have been introduced which either achieve a sub-optimal solution or they run in super-constant number of rounds of computation. Unlike previous work, we aim to design distributed algorithms in multiple rounds with almost optimal approximation guarantees at the cost of outputting a larger number of elements. Toward this goal, we present a distributed algorithm that, for any ε > 0 and any constant r, outputs a set S of O(rk/ε1/r) items in r rounds, and achieves a (1-ε)-approximation of the value of the optimum set with k items. This is the first distributed algorithm that achieves an approximation factor of (1-ε) running in less than log 1/ε number of rounds. We also prove a hardness result showing that the output of any 1-ε approximation distributed algorithm limited to one distributed round should have at least Ω(k/ε) items. In light of this hardness result, our distributed algorithm in one round, r = 1, is asymptotically tight in terms of the output size. We support the theoretical guarantees with an extensive empirical study of our algorithm showing that achieving almost optimum solutions is indeed possible in a few rounds for large-scale real datasets.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123557990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Chowdhury, P. Ganapathi, Yuan Tang, Jesmin Jahan Tithi
{"title":"Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms","authors":"R. Chowdhury, P. Ganapathi, Yuan Tang, Jesmin Jahan Tithi","doi":"10.1145/3087556.3087586","DOIUrl":"https://doi.org/10.1145/3087556.3087586","url":null,"abstract":"Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and hence are not portable and not cache-adaptive. On the other hand, standard cache-oblivious recursive divide-and-conquer algorithms have optimal serial cache complexity but often have low parallelism due to artificial dependencies among subtasks. Recently, we introduced cache-oblivious recursive wavefront (COW) algorithms, which do not have any artificial dependencies, but they are too complicated to develop, analyze, implement, and generalize. Though COW algorithms are based on fork-join primitives, they extensively use atomic operations for ensuring correctness, and as a result, performance guarantees (i.e., parallel running time and parallel cache complexity) provided by state-of-the-art schedulers (e.g., the randomized work-stealing scheduler) for programs with fork-join primitives do not apply. Also, extensive use of atomic locks may result in high overhead in implementation. In this paper, we show how to systematically transform standard cache-oblivious recursive divide-and-conquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under state-of-the-art schedulers for fork-join programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closed-form formulas to compute the time when each divide-and-conquer function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms. We present theoretical analyses and experimental performance and scalability results showing a superiority of these new algorithms over existing algorithms.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123171657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: KEYNOTE LECTURE 2","authors":"M. Hajiaghayi","doi":"10.1145/3257327","DOIUrl":"https://doi.org/10.1145/3257327","url":null,"abstract":"","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128248722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}