Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures最新文献_第3页

Beyond P vs. NP: Quadratic-Time Hardness for Big Data Problems 超越P与NP:大数据问题的二次时间硬度

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087603

P. Indyk

{"title":"Beyond P vs. NP: Quadratic-Time Hardness for Big Data Problems","authors":"P. Indyk","doi":"10.1145/3087556.3087603","DOIUrl":"https://doi.org/10.1145/3087556.3087603","url":null,"abstract":"The theory of NP-hardness has been very successful in identifying problems that are unlikely to be solvable in polynomial time. However, many other important problems do have polynomial time algorithms, but large exponents in their time bounds can make them run for days, weeks or more. For example, quadratic time algorithms, although practical on moderately sized inputs, can become inefficient on big data problems that involve gigabytes or more of data. Although for many problems no sub-quadratic time algorithms are known, any evidence of quadratic-time hardness has remained elusive. In this talk I will give an overview of recent research that aims to remedy this situation. In particular, I will describe hardness results for problems in string processing (e.g., edit distance computation or regular expression matching) and machine learning (e.g., Support Vector Machines or gradient computation in neural networks). All of them have polynomial time algorithms, but despite extensive amount of research, no near-linear time algorithms have been found for many variants of these problems. I will show that, under a natural complexity-theoretic conjecture, such algorithms do not exist. I will also describe how this framework has led to the development of new algorithms.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114997676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Julienne:一个使用工作效率桶的并行图算法框架

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087580

Laxman Dhulipala, G. Blelloch, Julian Shun

{"title":"Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing","authors":"Laxman Dhulipala, G. Blelloch, Julian Shun","doi":"10.1145/3087556.3087580","DOIUrl":"https://doi.org/10.1145/3087556.3087580","url":null,"abstract":"Existing graph-processing frameworks let users develop efficient implementations for many graph problems, but none of them support efficiently bucketing vertices, which is needed for bucketing-based graph algorithms such as Delta-stepping and approximate set-cover. Motivated by the lack of simple, scalable, and efficient implementations of bucketing-based algorithms, we develop the Julienne framework, which extends a recent shared-memory graph processing framework called Ligra with an interface for maintaining a collection of buckets under vertex insertions and bucket deletions. We provide a theoretically efficient parallel implementation of our bucketing interface and study several bucketing-based algorithms that make use of it (either bucketing by remaining degree or by distance) to improve performance: the peeling algorithm for k-core (coreness), Delta-stepping, weighted breadth-first search, and approximate set cover. The implementations are all simple and concise (under 100 lines of code). Using our interface, we develop the first work-efficient parallel algorithm for k-core in the literature with nontrivial parallelism. We experimentally show that our bucketing implementation scales well and achieves high throughput on both synthetic and real-world workloads. Furthermore, the bucketing-based algorithms written in Julienne achieve up to 43x speedup on 72 cores with hyper-threading over well-tuned sequential baselines, significantly outperform existing work-inefficient implementations in Ligra, and either outperform or are competitive with existing special-purpose parallel codes for the same problem. We experimentally study our implementations on the largest publicly available graphs and show that they scale well in practice, processing real-world graphs with billions of edges in seconds, and hundreds of billions of edges in a few minutes. As far as we know, this is the first time that graphs at this scale have been analyzed in the main memory of a single multicore machine.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128729773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 103

Session details: SESSION 1 会话详细信息:Session 1

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3257324

F. Heide

引用次数: 0

Some Sequential Algorithms are Almost Always Parallel 一些顺序算法几乎总是并行的

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087602

G. Blelloch

{"title":"Some Sequential Algorithms are Almost Always Parallel","authors":"G. Blelloch","doi":"10.1145/3087556.3087602","DOIUrl":"https://doi.org/10.1145/3087556.3087602","url":null,"abstract":"Over the years many interesting and efficient parallel algorithms have been developed to solve a wide variety of problems, but not much attention has been paid to studying the inherent parallelism in sequential algorithms---i.e., understanding the depth of their dependence structure, and how shallow dependence structures might beused to develop efficient parallel implementations. In this talk I will describe recent work on analyzing the dependence depth of iterative sequential algorithms---ones that loop over a collection of elements. Many of these algorithms have deep dependence chains in the worst case, but shallow chains (polylog w.h.p.) if the elements are randomly ordered. Examples include many fundamental algorithms: the Knuth shuffle for random permutations, sorting by insertion into a binary search tree, greedy maximal independent set (MIS), greedy maximal matching, greedy graph-coloring, counting cycles in a permutation, incremental k-dimensional linear programming, and incremental 2d Delaunay triangulation. An advantage of the approach is that it can lead to very simple and efficient parallel algorithms. Our MIS algorithm, for example can be coded in a dozen or so lines, and is significantly faster than Luby's algorithm on modern multicore machines. Also the approach encourages snapping the view that sequential and parallel algorithms are distinct, and instead thinking of algorithms, in general, as collections of instructions with dependences among them.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117209886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Concurrent Data Structures for Near-Memory Computing 面向近内存计算的并发数据结构

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087582

Zhiyu Liu, I. Calciu, M. Herlihy, O. Mutlu

{"title":"Concurrent Data Structures for Near-Memory Computing","authors":"Zhiyu Liu, I. Calciu, M. Herlihy, O. Mutlu","doi":"10.1145/3087556.3087582","DOIUrl":"https://doi.org/10.1145/3087556.3087582","url":null,"abstract":"The performance gap between memory and CPU has grown exponentially. To bridge this gap, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located close to memory. Due to its proximity to memory, a memory access from a PIM core is much faster than that from a CPU core. New advances in 3D integration and die-stacked memory make PIM viable in the near future. Prior work has shown significant performance improvements by using PIM for embarrassingly parallel and data-intensive applications, as well as for pointer-chasing traversals in sequential data structures. However, current server machines have hundreds of cores, and algorithms for concurrent data structures exploit these cores to achieve high throughput and scalability, with significant benefits over sequential data structures. Thus, it is important to examine how PIM performs with respect to modern concurrent data structures and understand how concurrent data structures can be developed to take advantage of PIM. This paper is the first to examine the design of concurrent data structures for PIM. We show two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrent data structures, such as pointer-chasing data structures and FIFO queues, (2) novel designs for PIM data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structures, with a significantly simpler design.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125820969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

Brief Announcement: Approximation Algorithms for Unsplittable Resource Allocation Problems with Diseconomies of Scale 简述:具有规模不经济的不可分割资源分配问题的近似算法

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087597

Antje Bjelde, Max Klimm, Daniel Schmand

{"title":"Brief Announcement: Approximation Algorithms for Unsplittable Resource Allocation Problems with Diseconomies of Scale","authors":"Antje Bjelde, Max Klimm, Daniel Schmand","doi":"10.1145/3087556.3087597","DOIUrl":"https://doi.org/10.1145/3087556.3087597","url":null,"abstract":"We study general resource allocation problems with a diseconomy of scale. Given a finite set of commodities that request certain resources, the cost of each resource grows superlinearly with the demand for it, and our goal is to minimize the total cost of the resources. In large systems with limited coordination, it is natural to consider local dynamics where in each step a single commodity switches its allocated resources whenever the new solution after the switch has smaller total cost over all commodities. This yields a deterministic and polynomial time algorithm with approximation factor arbitrarily close to the locality gap, i.e., the worst case ratio of the cost of a local optimal and a global optimal solution. For costs that are polynomials with non-negative coefficients and maximal degree d, we provide a locality gap for weighted problems that is tight for all values of d. For unweighted problems, the locality gap asymptotically matches the approximation guarantee of the currently best known centralized algorithm [Makarychev, Srividenko FOCS14] but only requires local knowledge of the commodities.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134327399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Brief Announcement: Parallel Dynamic Tree Contraction via Self-Adjusting Computation 简要公告:通过自调整计算实现并行动态树收缩

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087595

Umut A. Acar, V. Aksenov, Sam Westrick

引用次数: 9

Bicriteria Distributed Submodular Maximization in a Few Rounds 几轮双准则分布子模极大化

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087574

Alessandro Epasto, V. Mirrokni, Morteza Zadimoghaddam

{"title":"Bicriteria Distributed Submodular Maximization in a Few Rounds","authors":"Alessandro Epasto, V. Mirrokni, Morteza Zadimoghaddam","doi":"10.1145/3087556.3087574","DOIUrl":"https://doi.org/10.1145/3087556.3087574","url":null,"abstract":"We study the problem of efficiently optimizing submodular functions under cardinality constraints in distributed setting. Recently, several distributed algorithms for this problem have been introduced which either achieve a sub-optimal solution or they run in super-constant number of rounds of computation. Unlike previous work, we aim to design distributed algorithms in multiple rounds with almost optimal approximation guarantees at the cost of outputting a larger number of elements. Toward this goal, we present a distributed algorithm that, for any ε > 0 and any constant r, outputs a set S of O(rk/ε1/r) items in r rounds, and achieves a (1-ε)-approximation of the value of the optimum set with k items. This is the first distributed algorithm that achieves an approximation factor of (1-ε) running in less than log 1/ε number of rounds. We also prove a hardness result showing that the output of any 1-ε approximation distributed algorithm limited to one distributed round should have at least Ω(k/ε) items. In light of this hardness result, our distributed algorithm in one round, r = 1, is asymptotically tight in terms of the output size. We support the theoretical guarantees with an extensive empirical study of our algorithm showing that achieving almost optimum solutions is indeed possible in a few rounds for large-scale real datasets.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123557990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms 无关缓存波前算法的可证明高效调度

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3087556.3087586

R. Chowdhury, P. Ganapathi, Yuan Tang, Jesmin Jahan Tithi

{"title":"Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms","authors":"R. Chowdhury, P. Ganapathi, Yuan Tang, Jesmin Jahan Tithi","doi":"10.1145/3087556.3087586","DOIUrl":"https://doi.org/10.1145/3087556.3087586","url":null,"abstract":"Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and hence are not portable and not cache-adaptive. On the other hand, standard cache-oblivious recursive divide-and-conquer algorithms have optimal serial cache complexity but often have low parallelism due to artificial dependencies among subtasks. Recently, we introduced cache-oblivious recursive wavefront (COW) algorithms, which do not have any artificial dependencies, but they are too complicated to develop, analyze, implement, and generalize. Though COW algorithms are based on fork-join primitives, they extensively use atomic operations for ensuring correctness, and as a result, performance guarantees (i.e., parallel running time and parallel cache complexity) provided by state-of-the-art schedulers (e.g., the randomized work-stealing scheduler) for programs with fork-join primitives do not apply. Also, extensive use of atomic locks may result in high overhead in implementation. In this paper, we show how to systematically transform standard cache-oblivious recursive divide-and-conquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under state-of-the-art schedulers for fork-join programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closed-form formulas to compute the time when each divide-and-conquer function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms. We present theoretical analyses and experimental performance and scalability results showing a superiority of these new algorithms over existing algorithms.","PeriodicalId":162994,"journal":{"name":"Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123171657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Session details: KEYNOTE LECTURE 2 会议细节:主题演讲2

Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2017-07-24 DOI: 10.1145/3257327

M. Hajiaghayi

引用次数: 0