Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures最新文献

筛选
英文 中文
Communication optimal parallel multiplication of sparse random matrices 通信稀疏随机矩阵的最优并行乘法
Grey Ballard, A. Buluç, J. Demmel, L. Grigori, Benjamin Lipshitz, O. Schwartz, Sivan Toledo
{"title":"Communication optimal parallel multiplication of sparse random matrices","authors":"Grey Ballard, A. Buluç, J. Demmel, L. Grigori, Benjamin Lipshitz, O. Schwartz, Sivan Toledo","doi":"10.1145/2486159.2486196","DOIUrl":"https://doi.org/10.1145/2486159.2486196","url":null,"abstract":"Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdős-Rényi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128145154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
Session details: Session 8 会话详情:会话8
M. Bender
{"title":"Session details: Session 8","authors":"M. Bender","doi":"10.1145/3250645","DOIUrl":"https://doi.org/10.1145/3250645","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124656449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout 使用形状变形数据布局的通信高效高斯消去
Grey Ballard, J. Demmel, Benjamin Lipshitz, O. Schwartz, Sivan Toledo
{"title":"Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout","authors":"Grey Ballard, J. Demmel, Benjamin Lipshitz, O. Schwartz, Sivan Toledo","doi":"10.1145/2486159.2486198","DOIUrl":"https://doi.org/10.1145/2486159.2486198","url":null,"abstract":"High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125143511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Expected sum and maximum of displacement of random sensors for coverage of a domain: extended abstract 覆盖区域的随机传感器位移的期望和和最大值:扩展摘要
E. Kranakis, D. Krizanc, Oscar Morales-Ponce, L. Narayanan, J. Opatrny, S. Shende
{"title":"Expected sum and maximum of displacement of random sensors for coverage of a domain: extended abstract","authors":"E. Kranakis, D. Krizanc, Oscar Morales-Ponce, L. Narayanan, J. Opatrny, S. Shende","doi":"10.1145/2486159.2486171","DOIUrl":"https://doi.org/10.1145/2486159.2486171","url":null,"abstract":"Assume that n sensors with identical range r = f(n)⁄2n, for some f(n) ≥ 1 for all n, are thrown randomly and independently with the uniform distribution in the unit interval [0, 1]. They are required to move to new positions so as to cover the entire unit interval in the sense that every point in the interval is within the range of a sensor. We obtain tradeoffs between the expected sum and maximum of displacements of the sensors and their range required to accomplish this task. In particular, when f(n) -- 1 the expected total displacement is shown to be Θ(√n). For senors with larger ranges we present two algorithms that prove the upper bound for the sum drops sharply as f(n) increases. The first of these holds for f(n) ≥ 6 and shows the total movement of the sensors is O(√ ln n/f(n)) while the second holds for 12 ≤ f(n) ≤ ln n -- 2 ln ln n and gives an upper bound of O(lnn⁄ f(n)ef(n)/2). Note that the second algorithm improves upon the first for f(n) > ln ln n -- ln ln ln n. Further we show a lower bound, for any 1 < f(n) < √n of Ω(εf(n)ε--(1+ε)f(n)), ε > 0. For the case of the expected maximum displacement of a sensor when f(n) = 1 our bounds are Ω(n--1/2) and for any ε > 0, O(n--1/2+ε). For larger sensor ranges (up to (1 -- ε) ln n/n, ε > 0) the expected maximum displacement is shown to be Θ(ln n/n). We also obtain similar sum and maximum displacement and range tradeoffs for area coverage for sensors thrown at random in a unit square. In this case, for the expected maximum displacement our bounds are tight and for the expected sum they are within a factor of √ln n. Finally, we investigate the related problem of the expected total and maximum displacement for perimeter coverage (whereby only the perimeter of the region need be covered) of a unit square. For example, when n sensors of radius > 2/n are thrown randomly and independently with the uniform distribution in the interior of a unit square, we can show the total expected displacement required to cover the perimeter is n/12 + o(n).","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124569260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Session details: Session 2 会话详情:会话2
M. Halldórsson
{"title":"Session details: Session 2","authors":"M. Halldórsson","doi":"10.1145/3250639","DOIUrl":"https://doi.org/10.1145/3250639","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134376922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recursive design of hardware priority queues 硬件优先级队列递归设计
Y. Afek, A. Bremler-Barr, Liron Schiff
{"title":"Recursive design of hardware priority queues","authors":"Y. Afek, A. Bremler-Barr, Liron Schiff","doi":"10.1145/2486159.2486194","DOIUrl":"https://doi.org/10.1145/2486159.2486194","url":null,"abstract":"A recursive and fast construction of an n elements priority queue from exponentially smaller hardware priority queues and size n RAM is presented. All priority queue implementations to date either require O (log n) instructions per operation or exponential (with key size) space or expensive special hardware whose cost and latency dramatically increases with the priority queue size. Hence constructing a priority queue (PQ) from considerably smaller hardware priority queues (which are also much faster) while maintaining the O(1) steps per PQ operation is critical. Here we present such an acceleration technique called the Power Priority Queue (PPQ) technique. Specifically, an n elements PPQ is constructed from 2k-1 primitive priority queues of size k√n (k=2,3,...) and a RAM of size n, where the throughput of the construct beats that of a single, size n primitive hardware priority queue. For example an n elements PQ can be constructed from either three √n or five 3√n primitive H/W priority queues. Applying our technique to a TCAM based priority queue, results in TCAM-PPQ, a scalable perfect line rate fair queuing of millions of concurrent connections at speeds of 100 Gbps. This demonstrates the benefits of our scheme when used with hardware TCAM, we expect similar results with systolic arrays, shift-registers and similar technologies. As a by product of our technique we present an O(n) time sorting algorithm in a system equipped with a O(w√n) entries TCAM, where here n is the number of items, and w is the maximum number of bits required to represent an item, improving on a previous result that used an Ω(n) entries TCAM. Finally, we provide a lower bound on the time complexity of sorting n elements with TCAM of size O(n) that matches our TCAM based sorting algorithm.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Session details: Session 1 会话详细信息:会话1
Philipp Woelfel
{"title":"Session details: Session 1","authors":"Philipp Woelfel","doi":"10.1145/3250638","DOIUrl":"https://doi.org/10.1145/3250638","url":null,"abstract":"","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125942678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Broadcasting in logarithmic time for ad hoc network nodes on a line using mimo 使用mimo的线路上的自组织网络节点的对数时间广播
T. Janson, C. Schindelhauer
{"title":"Broadcasting in logarithmic time for ad hoc network nodes on a line using mimo","authors":"T. Janson, C. Schindelhauer","doi":"10.1145/2486159.2486190","DOIUrl":"https://doi.org/10.1145/2486159.2486190","url":null,"abstract":"We consider n wireless ad hoc network nodes with one antenna each and equidistantly placed on a line. The transmission power of each node is just large enough to reach its next neighbor. For this setting we show that a message can be broadcasted to all nodes in time O(log n) without increasing each node's transmission power. Our algorithm needs O(log n) messages and consumes a total energy which is only a constant factor larger than the standard approach where nodes sequentially transmit the broadcast message to their next neighbors. We obtain this by synchronizing the nodes on the fly and using MIMO (multiple input multiple output) techniques. To achieve this goal we analyze the communication capacity of multiple antennas positioned on a line and use a communication model which is based on electromagnetic fields in free space. We extend existing communication models which either reflect only the sender power or neglect the locations by concentrating only on the channel matrix. Here, we compute the scalar channel matrix from the locations of the antennas and thereby only consider line-of-sight-communication without obstacles, reflections, diffractions or scattering. First, we show that this communication model reduces to the SINR power model if the antennas are uncoordinated. We show that n coordinated antennas can send a signal which is n times more powerful than the sum of their transmission powers. Alternatively, the power can be reduced to an arbitrarily small polynomial with respect to the distance. For coordinated antennas we show how the well-known power gain for MISO (multiple input single output) and SIMO (single input multiple output) can be described in this model. Furthermore, we analyze the channel matrix and prove that in the free space model no diversity gain can be expected for MIMO. Finally, we present the logarithmic time broadcast algorithm which takes advantage of the MISO power gain by self-coordinating wireless nodes.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134013718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Reducing contention through priority updates 通过优先级更新减少争用
Julian Shun, G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons
{"title":"Reducing contention through priority updates","authors":"Julian Shun, G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons","doi":"10.1145/2486159.2486189","DOIUrl":"https://doi.org/10.1145/2486159.2486189","url":null,"abstract":"Memory contention can be a serious performance bottleneck in concurrent programs on shared-memory multicore architectures. Having all threads write to a small set of shared locations, for example, can lead to orders of magnitude loss in performance relative to all threads writing to distinct locations, or even relative to a single thread doing all the writes. Shared write access, however, can be very useful in parallel algorithms, concurrent data structures, and protocols for communicating among threads. We study the \"priority update\" operation as a useful primitive for limiting write contention in parallel and concurrent programs. A priority update takes as arguments a memory location, a new value, and a comparison function >p that enforces a partial order over values. The operation atomically compares the new value with the current value in the memory location, and writes the new value only if it has higher priority according to >p. On the implementation side, we show that if implemented appropriately, priority updates greatly reduce memory contention over standard writes or other atomic operations when locations have a high degree of sharing. This is shown both experimentally and theoretically. On the application side, we describe several uses of priority updates for implementing parallel algorithms and concurrent data structures, often in a way that is deterministic, guarantees progress, and avoids serial bottlenecks. We present experiments showing that a variety of such algorithms and data structures perform well under high degrees of sharing. Given the results, we believe that the priority update operation serves as a useful parallel primitive and good programming abstraction as (1) the user largely need not worry about the degree of sharing, (2) it can be used to avoid non-determinism since, in the common case when >p is a total order, priority updates commute, and (3) it has many applications to programs using shared data.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130445059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Between all and nothing - versatile aborts in hardware transactional memory 在所有和没有之间,通用在硬件事务性内存中终止
S. Diestelhorst, Martin Nowack, Michael F. Spear, C. Fetzer
{"title":"Between all and nothing - versatile aborts in hardware transactional memory","authors":"S. Diestelhorst, Martin Nowack, Michael F. Spear, C. Fetzer","doi":"10.1145/2486159.2486165","DOIUrl":"https://doi.org/10.1145/2486159.2486165","url":null,"abstract":"Hardware Transactional Memory (HTM) implementations are becoming available in commercial, off-the-shelf components. While generally comparable, some implementations deviate from the strict all-or-nothing property of pure Transactional Memory. We analyse these deviations and find that with small modifications, they can be used to accelerate and simplify both transactional and non-transactional programming constructs. At the heart of our extensions we enable access to the transaction's full register state in the abort handler in an existing HTM without extending the architectural register state. Access to the full register state enables applications in both transactional and non-transactional parallel programming: hybrid transactional memory; transactional escape actions; transactional suspend/resume; and alert-on-update.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122772740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信