Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures最新文献_第3页

Communication optimal parallel multiplication of sparse random matrices 通信稀疏随机矩阵的最优并行乘法

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486196

Grey Ballard, A. Buluç, J. Demmel, L. Grigori, Benjamin Lipshitz, O. Schwartz, Sivan Toledo

引用次数: 103

Session details: Session 8 会话详情:会话8

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/3250645

M. Bender

引用次数: 0

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout 使用形状变形数据布局的通信高效高斯消去

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486198

Grey Ballard, J. Demmel, Benjamin Lipshitz, O. Schwartz, Sivan Toledo

{"title":"Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout","authors":"Grey Ballard, J. Demmel, Benjamin Lipshitz, O. Schwartz, Sivan Toledo","doi":"10.1145/2486159.2486198","DOIUrl":"https://doi.org/10.1145/2486159.2486198","url":null,"abstract":"High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125143511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Expected sum and maximum of displacement of random sensors for coverage of a domain: extended abstract 覆盖区域的随机传感器位移的期望和和最大值:扩展摘要

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486171

E. Kranakis, D. Krizanc, Oscar Morales-Ponce, L. Narayanan, J. Opatrny, S. Shende

{"title":"Expected sum and maximum of displacement of random sensors for coverage of a domain: extended abstract","authors":"E. Kranakis, D. Krizanc, Oscar Morales-Ponce, L. Narayanan, J. Opatrny, S. Shende","doi":"10.1145/2486159.2486171","DOIUrl":"https://doi.org/10.1145/2486159.2486171","url":null,"abstract":"Assume that n sensors with identical range r = f(n)⁄2n, for some f(n) ≥ 1 for all n, are thrown randomly and independently with the uniform distribution in the unit interval [0, 1]. They are required to move to new positions so as to cover the entire unit interval in the sense that every point in the interval is within the range of a sensor. We obtain tradeoffs between the expected sum and maximum of displacements of the sensors and their range required to accomplish this task. In particular, when f(n) -- 1 the expected total displacement is shown to be Θ(√n). For senors with larger ranges we present two algorithms that prove the upper bound for the sum drops sharply as f(n) increases. The first of these holds for f(n) ≥ 6 and shows the total movement of the sensors is O(√ ln n/f(n)) while the second holds for 12 ≤ f(n) ≤ ln n -- 2 ln ln n and gives an upper bound of O(lnn⁄ f(n)ef(n)/2). Note that the second algorithm improves upon the first for f(n) > ln ln n -- ln ln ln n. Further we show a lower bound, for any 1 < f(n) < √n of Ω(εf(n)ε--(1+ε)f(n)), ε > 0. For the case of the expected maximum displacement of a sensor when f(n) = 1 our bounds are Ω(n--1/2) and for any ε > 0, O(n--1/2+ε). For larger sensor ranges (up to (1 -- ε) ln n/n, ε > 0) the expected maximum displacement is shown to be Θ(ln n/n). We also obtain similar sum and maximum displacement and range tradeoffs for area coverage for sensors thrown at random in a unit square. In this case, for the expected maximum displacement our bounds are tight and for the expected sum they are within a factor of √ln n. Finally, we investigate the related problem of the expected total and maximum displacement for perimeter coverage (whereby only the perimeter of the region need be covered) of a unit square. For example, when n sensors of radius > 2/n are thrown randomly and independently with the uniform distribution in the interior of a unit square, we can show the total expected displacement required to cover the perimeter is n/12 + o(n).","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124569260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Session details: Session 2 会话详情:会话2

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/3250639

M. Halldórsson

引用次数: 0

Recursive design of hardware priority queues 硬件优先级队列递归设计

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486194

Y. Afek, A. Bremler-Barr, Liron Schiff

{"title":"Recursive design of hardware priority queues","authors":"Y. Afek, A. Bremler-Barr, Liron Schiff","doi":"10.1145/2486159.2486194","DOIUrl":"https://doi.org/10.1145/2486159.2486194","url":null,"abstract":"A recursive and fast construction of an n elements priority queue from exponentially smaller hardware priority queues and size n RAM is presented. All priority queue implementations to date either require O (log n) instructions per operation or exponential (with key size) space or expensive special hardware whose cost and latency dramatically increases with the priority queue size. Hence constructing a priority queue (PQ) from considerably smaller hardware priority queues (which are also much faster) while maintaining the O(1) steps per PQ operation is critical. Here we present such an acceleration technique called the Power Priority Queue (PPQ) technique. Specifically, an n elements PPQ is constructed from 2k-1 primitive priority queues of size k√n (k=2,3,...) and a RAM of size n, where the throughput of the construct beats that of a single, size n primitive hardware priority queue. For example an n elements PQ can be constructed from either three √n or five 3√n primitive H/W priority queues. Applying our technique to a TCAM based priority queue, results in TCAM-PPQ, a scalable perfect line rate fair queuing of millions of concurrent connections at speeds of 100 Gbps. This demonstrates the benefits of our scheme when used with hardware TCAM, we expect similar results with systolic arrays, shift-registers and similar technologies. As a by product of our technique we present an O(n) time sorting algorithm in a system equipped with a O(w√n) entries TCAM, where here n is the number of items, and w is the maximum number of bits required to represent an item, improving on a previous result that used an Ω(n) entries TCAM. Finally, we provide a lower bound on the time complexity of sorting n elements with TCAM of size O(n) that matches our TCAM based sorting algorithm.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Session details: Session 1 会话详细信息:会话1

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/3250638

Philipp Woelfel

引用次数: 0

Broadcasting in logarithmic time for ad hoc network nodes on a line using mimo 使用mimo的线路上的自组织网络节点的对数时间广播

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486190

T. Janson, C. Schindelhauer

{"title":"Broadcasting in logarithmic time for ad hoc network nodes on a line using mimo","authors":"T. Janson, C. Schindelhauer","doi":"10.1145/2486159.2486190","DOIUrl":"https://doi.org/10.1145/2486159.2486190","url":null,"abstract":"We consider n wireless ad hoc network nodes with one antenna each and equidistantly placed on a line. The transmission power of each node is just large enough to reach its next neighbor. For this setting we show that a message can be broadcasted to all nodes in time O(log n) without increasing each node's transmission power. Our algorithm needs O(log n) messages and consumes a total energy which is only a constant factor larger than the standard approach where nodes sequentially transmit the broadcast message to their next neighbors. We obtain this by synchronizing the nodes on the fly and using MIMO (multiple input multiple output) techniques. To achieve this goal we analyze the communication capacity of multiple antennas positioned on a line and use a communication model which is based on electromagnetic fields in free space. We extend existing communication models which either reflect only the sender power or neglect the locations by concentrating only on the channel matrix. Here, we compute the scalar channel matrix from the locations of the antennas and thereby only consider line-of-sight-communication without obstacles, reflections, diffractions or scattering. First, we show that this communication model reduces to the SINR power model if the antennas are uncoordinated. We show that n coordinated antennas can send a signal which is n times more powerful than the sum of their transmission powers. Alternatively, the power can be reduced to an arbitrarily small polynomial with respect to the distance. For coordinated antennas we show how the well-known power gain for MISO (multiple input single output) and SIMO (single input multiple output) can be described in this model. Furthermore, we analyze the channel matrix and prove that in the free space model no diversity gain can be expected for MIMO. Finally, we present the logarithmic time broadcast algorithm which takes advantage of the MISO power gain by self-coordinating wireless nodes.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134013718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Reducing contention through priority updates 通过优先级更新减少争用

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486189

Julian Shun, G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons

{"title":"Reducing contention through priority updates","authors":"Julian Shun, G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons","doi":"10.1145/2486159.2486189","DOIUrl":"https://doi.org/10.1145/2486159.2486189","url":null,"abstract":"Memory contention can be a serious performance bottleneck in concurrent programs on shared-memory multicore architectures. Having all threads write to a small set of shared locations, for example, can lead to orders of magnitude loss in performance relative to all threads writing to distinct locations, or even relative to a single thread doing all the writes. Shared write access, however, can be very useful in parallel algorithms, concurrent data structures, and protocols for communicating among threads. We study the \"priority update\" operation as a useful primitive for limiting write contention in parallel and concurrent programs. A priority update takes as arguments a memory location, a new value, and a comparison function >p that enforces a partial order over values. The operation atomically compares the new value with the current value in the memory location, and writes the new value only if it has higher priority according to >p. On the implementation side, we show that if implemented appropriately, priority updates greatly reduce memory contention over standard writes or other atomic operations when locations have a high degree of sharing. This is shown both experimentally and theoretically. On the application side, we describe several uses of priority updates for implementing parallel algorithms and concurrent data structures, often in a way that is deterministic, guarantees progress, and avoids serial bottlenecks. We present experiments showing that a variety of such algorithms and data structures perform well under high degrees of sharing. Given the results, we believe that the priority update operation serves as a useful parallel primitive and good programming abstraction as (1) the user largely need not worry about the degree of sharing, (2) it can be used to avoid non-determinism since, in the common case when >p is a total order, priority updates commute, and (3) it has many applications to programs using shared data.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130445059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Between all and nothing - versatile aborts in hardware transactional memory 在所有和没有之间，通用在硬件事务性内存中终止

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI: 10.1145/2486159.2486165

S. Diestelhorst, Martin Nowack, Michael F. Spear, C. Fetzer

引用次数: 3