2008 37th International Conference on Parallel Processing最新文献

筛选
英文 中文
Maotai: View-Oriented Parallel Programming on CMT Processors 茅台:基于CMT处理器的面向视图并行编程
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.15
Jiaqi Zhang, Zhiyi Huang, Wenguang Chen, Qihang Huang, Weimin Zheng
{"title":"Maotai: View-Oriented Parallel Programming on CMT Processors","authors":"Jiaqi Zhang, Zhiyi Huang, Wenguang Chen, Qihang Huang, Weimin Zheng","doi":"10.1109/ICPP.2008.15","DOIUrl":"https://doi.org/10.1109/ICPP.2008.15","url":null,"abstract":"View-oriented parallel programming (VOPP) is a novel parallel programming model which uses views for communication between multiple processes. With the introduction of views, mutual exclusion and shared data access are bundled together, which offers both convenience and high performance to parallel programming. This paper presents the implementation of VOPP on chip-multi threading processors, e.g. UltraSPARC T1. We demonstrate that our implementation of VOPP on multi-core platforms (namely Maotai) shows significantly better performance than directly applying the original DSM implementation of VOPP (namely VODCA) on our platform. Besides, we compare the performance of VOPP with MPI and OpenMP. The experimental results demonstrate that VOPP has better scalability than both MPI and OpenMP on our platform.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121679558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication 并行稀疏矩阵-矩阵乘法的挑战与进展
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.45
A. Buluç, J. Gilbert
{"title":"Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication","authors":"A. Buluç, J. Gilbert","doi":"10.1109/ICPP.2008.45","DOIUrl":"https://doi.org/10.1109/ICPP.2008.45","url":null,"abstract":"We identify the challenges that are special to parallel sparse matrix-matrix multiplication (PSpGEMM). We show that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough non-trivial arithmetic operations to hide the communication costs as well as the sparsity overheads. We analyze the scalability of 1D and 2D algorithms for PSpGEMM. While the 1D algorithm is a variant of existing implementations, 2D algorithms presented are completely novel. Most of these algorithms are based on the previous research on parallel dense matrix multiplication. We also provide results from preliminary experiments with 2D algorithms.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116542279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 113
A Distributed Context-Free Language Constrained Shortest Path Algorithm 分布式上下文无关语言约束最短路径算法
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.67
Charles B. Ward, N. Wiegand, P. Bradford
{"title":"A Distributed Context-Free Language Constrained Shortest Path Algorithm","authors":"Charles B. Ward, N. Wiegand, P. Bradford","doi":"10.1109/ICPP.2008.67","DOIUrl":"https://doi.org/10.1109/ICPP.2008.67","url":null,"abstract":"Formal language constrained shortest path problems are concerned with finding shortest paths in labeled graphs. These labeled paths have the constraint that the concatenation of labels along a path constitute a valid string in some formal language Lambda over alphabet Sigma. These problems are well studied where the formal language is regular or context-free, and have been used in a variety of applications ranging from databases, to transportation planning, to programming languages. Barrett, Jacob, and Marathe's best algorithm for the context-free language constrained path problem runs in O(|V|3|N||P|) time, where N is the set of non-terminals for the input grammar and P is the set of productions (expressed in Chomsky Normal Form). We present a work and time efficient distributed version of this algorithm that may be distributed on up to O(|V||N|) nodes.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126011997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
ParColl: Partitioned Collective I/O on the Cray XT ParColl: Cray XT上的分区集合I/O
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.76
Weikuan Yu, J. Vetter
{"title":"ParColl: Partitioned Collective I/O on the Cray XT","authors":"Weikuan Yu, J. Vetter","doi":"10.1109/ICPP.2008.76","DOIUrl":"https://doi.org/10.1109/ICPP.2008.76","url":null,"abstract":"Collective I/O orchestrates I/O from parallel processes by aggregating fine-grained requests into large ones. However, its performance is typically a fraction of the potential I/O bandwidth on large scale platforms such as Cray XT. Based on our analysis, the time spent in global process synchronization dominates the actual time in file reads/writes, which imposes a 'collective wall' on the performance of collective I/O. In this paper, we introduce a novel technique called partitioned collective I/O (ParColl). ParColl augments the original two-phase collective I/O protocol with new mechanisms for file area partitioning, I/O aggregator distribution and intermediate file views. Through these mechanisms, a group of processes and their targeted file are consistently divided into a collection of small subgroups, each performing I/O aggregation in a disjoint manner. File consistency is maintained through intermediate file views when necessary. Together, these mechanisms greatly reduce the cost of global synchronization. Our experimental results demonstrate that ParColl significantly improves the performance and the scalability of collective I/O. In one case, we show a 416% improvement on 1024 processes for a visualization I/O benchmark. We also show that the I/O patterns in scientific applications can benefit significantly from this technique, e.g. BT-I/O and Flash I/O.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116018765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
A Replication Overlay Assisted Resource Discovery Service for Federated Systems 用于联邦系统的复制覆盖辅助资源发现服务
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.61
Hao Yang, Fan Ye, Zhen Liu
{"title":"A Replication Overlay Assisted Resource Discovery Service for Federated Systems","authors":"Hao Yang, Fan Ye, Zhen Liu","doi":"10.1109/ICPP.2008.61","DOIUrl":"https://doi.org/10.1109/ICPP.2008.61","url":null,"abstract":"Federated systems have recently attracted much attention because they allow loosely coupled organizations to share resources for common benefits. However, discovering resources across administrative boundaries is challenging. Despite their willingness to share resources, many organizations prefer not to export their internal resource description to unfamiliar parties. While it is highly desirable to facilitate such voluntary sharing, the system also needs to resolve resource queries in an efficient manner. Unfortunately, none of the existing resource discovery designs, either hierarchical or DHT-based, can address these two challenges in the same time.In this paper, we present the design and evaluation of ROADS, a Replication Overlay Assisted resource Discovery Service for federated systems. In ROADS, the resource owners only export summaries, which are condensed representations of their resource records. These summaries are aggregated along a hierarchy and used to direct queries to appropriate resource owners. To improve its efficiency and resiliency, ROADS replicates the summaries using server overlays that enable \"shortcuts'' in query forwarding. We have implemented ROADS and evaluated its performance through extensive analysis and experiments. The results show that ROADS outperforms a DHT-based design with 1-2 orders of magnitude less overhead in update messages and 50% less query forwarding time.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125134272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-Blocking Concurrent FIFO Queues with Single Word Synchronization Primitives 具有单字同步原语的非阻塞并发FIFO队列
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.82
C. Évéquoz
{"title":"Non-Blocking Concurrent FIFO Queues with Single Word Synchronization Primitives","authors":"C. Évéquoz","doi":"10.1109/ICPP.2008.82","DOIUrl":"https://doi.org/10.1109/ICPP.2008.82","url":null,"abstract":"We present 2 efficient and practical non-blocking implementations of a concurrent array-based FIFO queue that are suitable for both multiprocessor as well as preemptive multithreaded systems. It is well known that concurrent FIFO queues relying on mutual exclusion cause blocking, which have several drawbacks and degrade overall system performance. Link-based non-blocking queue algorithms have a memory management problem whereby a removed node from the queue can neither be freed nor reused because other threads may still be accessing the node. Existing solutions to this problem introduce a fair amount of overhead and, when the number of threads that can access the FIFO queue is moderate to high, are shown to be less efficient compared to array-based algorithms, which inherently do not suffer from this problem. In addition to being independent on advance knowledge of the number of threads that can access the queue, our new algorithms improve on previously proposed algorithms in that they do not require any special instruction other than a load-linked/store-conditional or a compare-and-swap atomic instruction both operating on pointer-wide number of bits. Our new algorithms are thus portable to a broader range of architectures.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"300 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131649371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine 在Cell宽带引擎上优化JPEG2000静态图像编码
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.39
Seunghwa Kang, David A. Bader
{"title":"Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine","authors":"Seunghwa Kang, David A. Bader","doi":"10.1109/ICPP.2008.39","DOIUrl":"https://doi.org/10.1109/ICPP.2008.39","url":null,"abstract":"JPEG2000 is the latest still image coding standard from the JPEG committee, which adopts new algorithms such as embedded block coding with optimized truncation (EBCOT) and discrete wavelet transform (DWT). These algorithms enable superior coding performance over JPEG and support various new features at the cost of the increased computational complexity. The Sony-Toshiba-IBM cell broadband engine (or the Cell/B.E.) is a heterogeneous multicore architecture with SIMD accelerators. In this work, we optimize the computationally intensive algorithmic kernels of JPEG2000 for the Cell/B.E. and also introduce a novel data decomposition scheme to achieve high performance with low programming complexity. We compare the Cell/B.E.'s performance to the performance of the Intel Pentium IV 3.2 GHz processor. The Cell/B.E. demonstrates 3.2 times higher performance for lossless encoding and 2.7 times higher performance for lossy encoding. For the DWT, the Cell/B.E. outperforms the Pentium IV processor by 9.1 times for the lossless case and 15 times for the lossy case. We also provide the experimental results on one IBM QS20 blade with two Cell/B.E. chips and the performance comparison with the existing JPEG2000 encoder for the Cell/B.E.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116440618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Multiway Partitioning Algorithm for Parallel Gate Level Verilog Simulation 并行门级Verilog仿真的多路划分算法
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.89
Lijun Li, C. Tropper
{"title":"A Multiway Partitioning Algorithm for Parallel Gate Level Verilog Simulation","authors":"Lijun Li, C. Tropper","doi":"10.1109/ICPP.2008.89","DOIUrl":"https://doi.org/10.1109/ICPP.2008.89","url":null,"abstract":"We describe, in this paper, a multiway partitioning algorithm for parallel gate level Verilog simulation. The algorithm is an extension of a multi-level algorithm which only creates two partitions. Like its predecessor, it takes advantage of the design hierarchy present in a Verilog circuit design. The information it makes use of is contained in the modules and their instances. The algorithm makes use of a hypergraph model of the Verilog design in which a vertex in the hypergraph represents a module instance. Our new algorithm relies upon a metric whose function is to balance the load and the communications between the modules of the Verilog design. pre-simulation is used to to evaluate the partitioning metric. When compared to hMetis, a well known multilevel partitioning algorithm, our algorithm produces a superior speedup and a reduced cut-size.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"248 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123259018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Enabling Streaming Remoting on Embedded Dual-Core Processors 在嵌入式双核处理器上启用流远程处理
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.32
Kun-Yuan Hsieh, Yen-Chih Liu, Po-Wen Wu, Shou-Wei Chang, Jenq-Kuen Lee
{"title":"Enabling Streaming Remoting on Embedded Dual-Core Processors","authors":"Kun-Yuan Hsieh, Yen-Chih Liu, Po-Wen Wu, Shou-Wei Chang, Jenq-Kuen Lee","doi":"10.1109/ICPP.2008.32","DOIUrl":"https://doi.org/10.1109/ICPP.2008.32","url":null,"abstract":"Dual-core processors (and, to an extent, multicore processors) have been adopted in recent years to provide platforms that satisfy the performance requirements of popular multimedia applications. This architecture comprises groups of processing units connected by various interprocess communication mechanisms such as shared memory, memory mapping interrupts, mailboxes, and channel-based protocols. The associated challenges include how to provide programming models and environments for developing streaming applications for such platforms. In this paper, we present middleware called streaming RPC for supporting a streaming-function remoting mechanism on asymmetric dual-core architectures. This middleware has been implemented both on an experimental platform known as the PAC dual-core platform and in TI OMAP dual-core environments. We also present an analytic model of streaming equations to optimize the internal handshaking for our proposed streaming RPC. The usage and efficiency of the proposed methodology are demonstrated in a JPEG decoder, MP3 decoder, and QCIF H.264 decoder. The experimental results show that our approach improves the performance of the decoders of JPEG, MP3, and H.264 by 24%, 38%, and 32% on PAC, respectively. The communication load of internal handshaking has also been reduced compared to the naive use of RPC over embedded dual-core systems. The experiments also show that the performance improvement can also be achieved on OMAP dual-core platforms.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123162677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Thermal Management for 3D Processors via Task Scheduling 基于任务调度的3D处理器热管理
2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.51
Xiuyi Zhou, Yi Xu, Yu Du, Youtao Zhang, Jun Yang
{"title":"Thermal Management for 3D Processors via Task Scheduling","authors":"Xiuyi Zhou, Yi Xu, Yu Du, Youtao Zhang, Jun Yang","doi":"10.1109/ICPP.2008.51","DOIUrl":"https://doi.org/10.1109/ICPP.2008.51","url":null,"abstract":"A rising horizon in chip fabrication is the 3D integration technology. It stacks two or more dies vertically with a dense, high-speed interface to increase the device density and reduce the delay of interconnects across the dies. However, a major challenge in 3D technology is the increased power density which brings the concern of heat dissipation within the processor. High temperatures trigger voltage and frequency throttlings in hardware which degrade the chip performance. Moreover, high temperatures impair the processorpsilas reliability and reduce its lifetime. To alleviate this problem, we propose in this paper an OS-level scheduling algorithm that performs thermal-aware task scheduling on a 3D chip. Our algorithm leverages the inherent thermal variations within and across different tasks, and schedules them to keep the chip temperature low. We observed that vertically adjacent dies have strong thermal correlations, and the scheduler should consider them jointly. Our proposed algorithm can remove on average 54% of hardware DTMs and result in 7.2% performance improvement over the base case.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125435222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信