2008 37th International Conference on Parallel Processing最新文献_第5页

Maotai: View-Oriented Parallel Programming on CMT Processors 茅台:基于CMT处理器的面向视图并行编程

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.15

Jiaqi Zhang, Zhiyi Huang, Wenguang Chen, Qihang Huang, Weimin Zheng

引用次数: 21

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication 并行稀疏矩阵-矩阵乘法的挑战与进展

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.45

A. Buluç, J. Gilbert

引用次数: 113

A Distributed Context-Free Language Constrained Shortest Path Algorithm 分布式上下文无关语言约束最短路径算法

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.67

Charles B. Ward, N. Wiegand, P. Bradford

引用次数: 4

ParColl: Partitioned Collective I/O on the Cray XT ParColl: Cray XT上的分区集合I/O

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.76

Weikuan Yu, J. Vetter

{"title":"ParColl: Partitioned Collective I/O on the Cray XT","authors":"Weikuan Yu, J. Vetter","doi":"10.1109/ICPP.2008.76","DOIUrl":"https://doi.org/10.1109/ICPP.2008.76","url":null,"abstract":"Collective I/O orchestrates I/O from parallel processes by aggregating fine-grained requests into large ones. However, its performance is typically a fraction of the potential I/O bandwidth on large scale platforms such as Cray XT. Based on our analysis, the time spent in global process synchronization dominates the actual time in file reads/writes, which imposes a 'collective wall' on the performance of collective I/O. In this paper, we introduce a novel technique called partitioned collective I/O (ParColl). ParColl augments the original two-phase collective I/O protocol with new mechanisms for file area partitioning, I/O aggregator distribution and intermediate file views. Through these mechanisms, a group of processes and their targeted file are consistently divided into a collection of small subgroups, each performing I/O aggregation in a disjoint manner. File consistency is maintained through intermediate file views when necessary. Together, these mechanisms greatly reduce the cost of global synchronization. Our experimental results demonstrate that ParColl significantly improves the performance and the scalability of collective I/O. In one case, we show a 416% improvement on 1024 processes for a visualization I/O benchmark. We also show that the I/O patterns in scientific applications can benefit significantly from this technique, e.g. BT-I/O and Flash I/O.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116018765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

A Replication Overlay Assisted Resource Discovery Service for Federated Systems 用于联邦系统的复制覆盖辅助资源发现服务

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.61

Hao Yang, Fan Ye, Zhen Liu

{"title":"A Replication Overlay Assisted Resource Discovery Service for Federated Systems","authors":"Hao Yang, Fan Ye, Zhen Liu","doi":"10.1109/ICPP.2008.61","DOIUrl":"https://doi.org/10.1109/ICPP.2008.61","url":null,"abstract":"Federated systems have recently attracted much attention because they allow loosely coupled organizations to share resources for common benefits. However, discovering resources across administrative boundaries is challenging. Despite their willingness to share resources, many organizations prefer not to export their internal resource description to unfamiliar parties. While it is highly desirable to facilitate such voluntary sharing, the system also needs to resolve resource queries in an efficient manner. Unfortunately, none of the existing resource discovery designs, either hierarchical or DHT-based, can address these two challenges in the same time.In this paper, we present the design and evaluation of ROADS, a Replication Overlay Assisted resource Discovery Service for federated systems. In ROADS, the resource owners only export summaries, which are condensed representations of their resource records. These summaries are aggregated along a hierarchy and used to direct queries to appropriate resource owners. To improve its efficiency and resiliency, ROADS replicates the summaries using server overlays that enable \"shortcuts'' in query forwarding. We have implemented ROADS and evaluated its performance through extensive analysis and experiments. The results show that ROADS outperforms a DHT-based design with 1-2 orders of magnitude less overhead in update messages and 50% less query forwarding time.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125134272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Non-Blocking Concurrent FIFO Queues with Single Word Synchronization Primitives 具有单字同步原语的非阻塞并发FIFO队列

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.82

C. Évéquoz

{"title":"Non-Blocking Concurrent FIFO Queues with Single Word Synchronization Primitives","authors":"C. Évéquoz","doi":"10.1109/ICPP.2008.82","DOIUrl":"https://doi.org/10.1109/ICPP.2008.82","url":null,"abstract":"We present 2 efficient and practical non-blocking implementations of a concurrent array-based FIFO queue that are suitable for both multiprocessor as well as preemptive multithreaded systems. It is well known that concurrent FIFO queues relying on mutual exclusion cause blocking, which have several drawbacks and degrade overall system performance. Link-based non-blocking queue algorithms have a memory management problem whereby a removed node from the queue can neither be freed nor reused because other threads may still be accessing the node. Existing solutions to this problem introduce a fair amount of overhead and, when the number of threads that can access the FIFO queue is moderate to high, are shown to be less efficient compared to array-based algorithms, which inherently do not suffer from this problem. In addition to being independent on advance knowledge of the number of threads that can access the queue, our new algorithms improve on previously proposed algorithms in that they do not require any special instruction other than a load-linked/store-conditional or a compare-and-swap atomic instruction both operating on pointer-wide number of bits. Our new algorithms are thus portable to a broader range of architectures.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"300 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131649371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine 在Cell宽带引擎上优化JPEG2000静态图像编码

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.39

Seunghwa Kang, David A. Bader

{"title":"Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine","authors":"Seunghwa Kang, David A. Bader","doi":"10.1109/ICPP.2008.39","DOIUrl":"https://doi.org/10.1109/ICPP.2008.39","url":null,"abstract":"JPEG2000 is the latest still image coding standard from the JPEG committee, which adopts new algorithms such as embedded block coding with optimized truncation (EBCOT) and discrete wavelet transform (DWT). These algorithms enable superior coding performance over JPEG and support various new features at the cost of the increased computational complexity. The Sony-Toshiba-IBM cell broadband engine (or the Cell/B.E.) is a heterogeneous multicore architecture with SIMD accelerators. In this work, we optimize the computationally intensive algorithmic kernels of JPEG2000 for the Cell/B.E. and also introduce a novel data decomposition scheme to achieve high performance with low programming complexity. We compare the Cell/B.E.'s performance to the performance of the Intel Pentium IV 3.2 GHz processor. The Cell/B.E. demonstrates 3.2 times higher performance for lossless encoding and 2.7 times higher performance for lossy encoding. For the DWT, the Cell/B.E. outperforms the Pentium IV processor by 9.1 times for the lossless case and 15 times for the lossy case. We also provide the experimental results on one IBM QS20 blade with two Cell/B.E. chips and the performance comparison with the existing JPEG2000 encoder for the Cell/B.E.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116440618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Multiway Partitioning Algorithm for Parallel Gate Level Verilog Simulation 并行门级Verilog仿真的多路划分算法

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.89

Lijun Li, C. Tropper

引用次数: 1

Enabling Streaming Remoting on Embedded Dual-Core Processors 在嵌入式双核处理器上启用流远程处理

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.32

Kun-Yuan Hsieh, Yen-Chih Liu, Po-Wen Wu, Shou-Wei Chang, Jenq-Kuen Lee

{"title":"Enabling Streaming Remoting on Embedded Dual-Core Processors","authors":"Kun-Yuan Hsieh, Yen-Chih Liu, Po-Wen Wu, Shou-Wei Chang, Jenq-Kuen Lee","doi":"10.1109/ICPP.2008.32","DOIUrl":"https://doi.org/10.1109/ICPP.2008.32","url":null,"abstract":"Dual-core processors (and, to an extent, multicore processors) have been adopted in recent years to provide platforms that satisfy the performance requirements of popular multimedia applications. This architecture comprises groups of processing units connected by various interprocess communication mechanisms such as shared memory, memory mapping interrupts, mailboxes, and channel-based protocols. The associated challenges include how to provide programming models and environments for developing streaming applications for such platforms. In this paper, we present middleware called streaming RPC for supporting a streaming-function remoting mechanism on asymmetric dual-core architectures. This middleware has been implemented both on an experimental platform known as the PAC dual-core platform and in TI OMAP dual-core environments. We also present an analytic model of streaming equations to optimize the internal handshaking for our proposed streaming RPC. The usage and efficiency of the proposed methodology are demonstrated in a JPEG decoder, MP3 decoder, and QCIF H.264 decoder. The experimental results show that our approach improves the performance of the decoders of JPEG, MP3, and H.264 by 24%, 38%, and 32% on PAC, respectively. The communication load of internal handshaking has also been reduced compared to the naive use of RPC over embedded dual-core systems. The experiments also show that the performance improvement can also be achieved on OMAP dual-core platforms.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123162677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Thermal Management for 3D Processors via Task Scheduling 基于任务调度的3D处理器热管理

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI: 10.1109/ICPP.2008.51

Xiuyi Zhou, Yi Xu, Yu Du, Youtao Zhang, Jun Yang

引用次数: 71