Grey Ballard, A. Buluç, J. Demmel, L. Grigori, Benjamin Lipshitz, O. Schwartz, Sivan Toledo
{"title":"Communication optimal parallel multiplication of sparse random matrices","authors":"Grey Ballard, A. Buluç, J. Demmel, L. Grigori, Benjamin Lipshitz, O. Schwartz, Sivan Toledo","doi":"10.1145/2486159.2486196","DOIUrl":"https://doi.org/10.1145/2486159.2486196","url":null,"abstract":"Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdős-Rényi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128145154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 8","authors":"M. Bender","doi":"10.1145/3250645","DOIUrl":"https://doi.org/10.1145/3250645","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124656449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grey Ballard, J. Demmel, Benjamin Lipshitz, O. Schwartz, Sivan Toledo
{"title":"Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout","authors":"Grey Ballard, J. Demmel, Benjamin Lipshitz, O. Schwartz, Sivan Toledo","doi":"10.1145/2486159.2486198","DOIUrl":"https://doi.org/10.1145/2486159.2486198","url":null,"abstract":"High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125143511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Kranakis, D. Krizanc, Oscar Morales-Ponce, L. Narayanan, J. Opatrny, S. Shende
{"title":"Expected sum and maximum of displacement of random sensors for coverage of a domain: extended abstract","authors":"E. Kranakis, D. Krizanc, Oscar Morales-Ponce, L. Narayanan, J. Opatrny, S. Shende","doi":"10.1145/2486159.2486171","DOIUrl":"https://doi.org/10.1145/2486159.2486171","url":null,"abstract":"Assume that n sensors with identical range r = f(n)⁄2n, for some f(n) ≥ 1 for all n, are thrown randomly and independently with the uniform distribution in the unit interval [0, 1]. They are required to move to new positions so as to cover the entire unit interval in the sense that every point in the interval is within the range of a sensor. We obtain tradeoffs between the expected sum and maximum of displacements of the sensors and their range required to accomplish this task. In particular, when f(n) -- 1 the expected total displacement is shown to be Θ(√n). For senors with larger ranges we present two algorithms that prove the upper bound for the sum drops sharply as f(n) increases. The first of these holds for f(n) ≥ 6 and shows the total movement of the sensors is O(√ ln n/f(n)) while the second holds for 12 ≤ f(n) ≤ ln n -- 2 ln ln n and gives an upper bound of O(lnn⁄ f(n)ef(n)/2). Note that the second algorithm improves upon the first for f(n) > ln ln n -- ln ln ln n. Further we show a lower bound, for any 1 < f(n) < √n of Ω(εf(n)ε--(1+ε)f(n)), ε > 0. For the case of the expected maximum displacement of a sensor when f(n) = 1 our bounds are Ω(n--1/2) and for any ε > 0, O(n--1/2+ε). For larger sensor ranges (up to (1 -- ε) ln n/n, ε > 0) the expected maximum displacement is shown to be Θ(ln n/n). We also obtain similar sum and maximum displacement and range tradeoffs for area coverage for sensors thrown at random in a unit square. In this case, for the expected maximum displacement our bounds are tight and for the expected sum they are within a factor of √ln n. Finally, we investigate the related problem of the expected total and maximum displacement for perimeter coverage (whereby only the perimeter of the region need be covered) of a unit square. For example, when n sensors of radius > 2/n are thrown randomly and independently with the uniform distribution in the interior of a unit square, we can show the total expected displacement required to cover the perimeter is n/12 + o(n).","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124569260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 2","authors":"M. Halldórsson","doi":"10.1145/3250639","DOIUrl":"https://doi.org/10.1145/3250639","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134376922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recursive design of hardware priority queues","authors":"Y. Afek, A. Bremler-Barr, Liron Schiff","doi":"10.1145/2486159.2486194","DOIUrl":"https://doi.org/10.1145/2486159.2486194","url":null,"abstract":"A recursive and fast construction of an n elements priority queue from exponentially smaller hardware priority queues and size n RAM is presented. All priority queue implementations to date either require O (log n) instructions per operation or exponential (with key size) space or expensive special hardware whose cost and latency dramatically increases with the priority queue size. Hence constructing a priority queue (PQ) from considerably smaller hardware priority queues (which are also much faster) while maintaining the O(1) steps per PQ operation is critical. Here we present such an acceleration technique called the Power Priority Queue (PPQ) technique. Specifically, an n elements PPQ is constructed from 2k-1 primitive priority queues of size k√n (k=2,3,...) and a RAM of size n, where the throughput of the construct beats that of a single, size n primitive hardware priority queue. For example an n elements PQ can be constructed from either three √n or five 3√n primitive H/W priority queues. Applying our technique to a TCAM based priority queue, results in TCAM-PPQ, a scalable perfect line rate fair queuing of millions of concurrent connections at speeds of 100 Gbps. This demonstrates the benefits of our scheme when used with hardware TCAM, we expect similar results with systolic arrays, shift-registers and similar technologies. As a by product of our technique we present an O(n) time sorting algorithm in a system equipped with a O(w√n) entries TCAM, where here n is the number of items, and w is the maximum number of bits required to represent an item, improving on a previous result that used an Ω(n) entries TCAM. Finally, we provide a lower bound on the time complexity of sorting n elements with TCAM of size O(n) that matches our TCAM based sorting algorithm.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 1","authors":"Philipp Woelfel","doi":"10.1145/3250638","DOIUrl":"https://doi.org/10.1145/3250638","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125942678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Broadcasting in logarithmic time for ad hoc network nodes on a line using mimo","authors":"T. Janson, C. Schindelhauer","doi":"10.1145/2486159.2486190","DOIUrl":"https://doi.org/10.1145/2486159.2486190","url":null,"abstract":"We consider n wireless ad hoc network nodes with one antenna each and equidistantly placed on a line. The transmission power of each node is just large enough to reach its next neighbor. For this setting we show that a message can be broadcasted to all nodes in time O(log n) without increasing each node's transmission power. Our algorithm needs O(log n) messages and consumes a total energy which is only a constant factor larger than the standard approach where nodes sequentially transmit the broadcast message to their next neighbors. We obtain this by synchronizing the nodes on the fly and using MIMO (multiple input multiple output) techniques. To achieve this goal we analyze the communication capacity of multiple antennas positioned on a line and use a communication model which is based on electromagnetic fields in free space. We extend existing communication models which either reflect only the sender power or neglect the locations by concentrating only on the channel matrix. Here, we compute the scalar channel matrix from the locations of the antennas and thereby only consider line-of-sight-communication without obstacles, reflections, diffractions or scattering. First, we show that this communication model reduces to the SINR power model if the antennas are uncoordinated. We show that n coordinated antennas can send a signal which is n times more powerful than the sum of their transmission powers. Alternatively, the power can be reduced to an arbitrarily small polynomial with respect to the distance. For coordinated antennas we show how the well-known power gain for MISO (multiple input single output) and SIMO (single input multiple output) can be described in this model. Furthermore, we analyze the channel matrix and prove that in the free space model no diversity gain can be expected for MIMO. Finally, we present the logarithmic time broadcast algorithm which takes advantage of the MISO power gain by self-coordinating wireless nodes.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134013718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Shun, G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons
{"title":"Reducing contention through priority updates","authors":"Julian Shun, G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons","doi":"10.1145/2486159.2486189","DOIUrl":"https://doi.org/10.1145/2486159.2486189","url":null,"abstract":"Memory contention can be a serious performance bottleneck in concurrent programs on shared-memory multicore architectures. Having all threads write to a small set of shared locations, for example, can lead to orders of magnitude loss in performance relative to all threads writing to distinct locations, or even relative to a single thread doing all the writes. Shared write access, however, can be very useful in parallel algorithms, concurrent data structures, and protocols for communicating among threads. We study the \"priority update\" operation as a useful primitive for limiting write contention in parallel and concurrent programs. A priority update takes as arguments a memory location, a new value, and a comparison function >p that enforces a partial order over values. The operation atomically compares the new value with the current value in the memory location, and writes the new value only if it has higher priority according to >p. On the implementation side, we show that if implemented appropriately, priority updates greatly reduce memory contention over standard writes or other atomic operations when locations have a high degree of sharing. This is shown both experimentally and theoretically. On the application side, we describe several uses of priority updates for implementing parallel algorithms and concurrent data structures, often in a way that is deterministic, guarantees progress, and avoids serial bottlenecks. We present experiments showing that a variety of such algorithms and data structures perform well under high degrees of sharing. Given the results, we believe that the priority update operation serves as a useful parallel primitive and good programming abstraction as (1) the user largely need not worry about the degree of sharing, (2) it can be used to avoid non-determinism since, in the common case when >p is a total order, priority updates commute, and (3) it has many applications to programs using shared data.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130445059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Diestelhorst, Martin Nowack, Michael F. Spear, C. Fetzer
{"title":"Between all and nothing - versatile aborts in hardware transactional memory","authors":"S. Diestelhorst, Martin Nowack, Michael F. Spear, C. Fetzer","doi":"10.1145/2486159.2486165","DOIUrl":"https://doi.org/10.1145/2486159.2486165","url":null,"abstract":"Hardware Transactional Memory (HTM) implementations are becoming available in commercial, off-the-shelf components. While generally comparable, some implementations deviate from the strict all-or-nothing property of pure Transactional Memory. We analyse these deviations and find that with small modifications, they can be used to accelerate and simplify both transactional and non-transactional programming constructs. At the heart of our extensions we enable access to the transaction's full register state in the abort handler in an existing HTM without extending the architectural register state. Access to the full register state enables applications in both transactional and non-transactional parallel programming: hybrid transactional memory; transactional escape actions; transactional suspend/resume; and alert-on-update.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122772740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}