{"title":"Fast Source Switching for Gossip-Based Peer-to-Peer Streaming","authors":"Zhenhua Li, Jiannong Cao, Guihai Chen, Yan Liu","doi":"10.1109/ICPP.2008.21","DOIUrl":"https://doi.org/10.1109/ICPP.2008.21","url":null,"abstract":"In this paper we consider gossip-based peer-to-peer streaming applications where multiple sources exist and they work serially. More specifically, we tackle the problem of fast source switching to minimize the startup delay of the new source. We model the source switch process and formulate it into an optimization problem. Then we propose a practical greedy algorithm that can approximate the optimal solution by properly interleaving the data delivery of the old source and the new source. We perform simulations on various real-trace overlay topologies to demonstrate the effectiveness of our algorithm. The simulation results show that our proposed algorithm outperforms the normal source switch algorithm by reducing the source switch time by 20%-30% without bringing extra communication overhead, and the reduction ratio tends to increase when the network scale expands.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124044442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deadlock-Free Fully Adaptive Routing in Tori Based on a New Virtual Network Partitioning Scheme","authors":"D. Xiang, Qi Wang, Yi Pan","doi":"10.1109/ICPP.2008.81","DOIUrl":"https://doi.org/10.1109/ICPP.2008.81","url":null,"abstract":"A new, deadlock-free, fully adaptive routing algorithm is proposed for worm hole-switched 3-dimensional tori with only two virtual channels. The deadlock avoidance technique is presented based on a new virtual network partitioning scheme. Unlike the previous virtual network partitioning schemes, the new method allows all virtual networks to share some common virtual channels. A new virtual channel assignment scheme is proposed for the 3-dimensional mesh subnetwork by using a channel overlap scheme. A combination of the virtual network partitioning scheme and the channel overlap scheme provides a deadlock-free fully adaptive routing for 3-dimensional tori. Sufficient theoretical analysis on the proposed virtual network partitioning scheme is presented. Simulation results are presented to demonstrate the effectiveness of the proposed algorithm by comparing with several important previous methods.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124362563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Address Compression and Heterogeneous Interconnects for Energy-Efficient High-Performance in Tiled CMPs","authors":"A. Flores, M. Acacio, Juan L. Aragón","doi":"10.1109/ICPP.2008.33","DOIUrl":"https://doi.org/10.1109/ICPP.2008.33","url":null,"abstract":"Previous studies have shown that the interconnection network of a chip-multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth and power characteristics. In this work, we present a proposal for performance- and energy-efficient message management in tiled CMPs that combines both address compression with a heterogeneous interconnect. Our proposal consists of applying an address compression scheme that dynamically compresses the addresses within coherence messages allowing for a significant area slack. The arising area can be exploited for wire latency improvement by using a heterogeneous interconnection network comprised of a small set of very-low-latency wires for critical short-messages in addition to baseline wires. Detailed simulations of a 16-core CMP show that our proposal obtains average improvements of 10% in execution time and 38% in the Energy-Delay2 Product of the interconnect.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121058527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Realistic Models and Efficient Algorithms for Fault Tolerant Scheduling on Heterogeneous Platforms","authors":"A. Benoit, M. Hakem, Y. Robert","doi":"10.1109/ICPP.2008.22","DOIUrl":"https://doi.org/10.1109/ICPP.2008.22","url":null,"abstract":"Most list scheduling heuristics rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors in the systems are completely safe. To schedule precedence graphs in a more realistic framework, we introduce an efficient fault tolerant scheduling algorithm that is both contention-aware and capable of supporting epsiv arbitrary fail-silent/fail-stop processor failures. We focus on a bi- criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithm has a low time complexity, and drastically reduces the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithm, which leads to efficient execution schemes while guaranteeing a prescribed level of fault tolerance.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123008989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of Automatic Parallelization to Modern Challenges of Scientific Computing Industries","authors":"Brian Armstrong, R. Eigenmann","doi":"10.1109/ICPP.2008.65","DOIUrl":"https://doi.org/10.1109/ICPP.2008.65","url":null,"abstract":"Characteristics of full applications found in scientific computing industries today lead to challenges that are not addressed by state-of-the-art approaches to automatic parallelization.These characteristics are not present in CPU kernel codes nor linear algebra libraries, requiring a fresh look at how to make automatic parallelization apply to today's computational industries using full applications. The challenges to automatic parallelization result from software engineering patterns that implement multifunctionality, reusable execution frameworks, data structures shared across abstract programming interfaces, a multilingual code base for a single application, and the observation that full applications demand more from compile-time analysis than CPU kernel codes do. Each of these challenges has a detrimental impact on compile-time analysis required for automatic parallelization. Then, focusing on a set of target loops that are parallelizable by hand and that result in speedups on par with the distributed parallel version of the full applications, we determine the prevalence of a number of issues that hinder automatic parallelization. These issues point to enabling techniques that are missing from the state-of-the-art.In order for automatic parallelization to become utilizedin today's scientific computing industries, the challenges described in this paper must be addressed.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115535484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VELO: A Novel Communication Engine for Ultra-Low Latency Message Transfers","authors":"Heiner Litz, H. Fröning, M. Nüssle, U. Brüning","doi":"10.1109/ICPP.2008.85","DOIUrl":"https://doi.org/10.1109/ICPP.2008.85","url":null,"abstract":"This paper presents a novel stateless, virtualized communication engine for sub-microsecond latency. Using a field-programmable-gate-array (FPGA) based prototype we show a latency of 970 ns between two machines with our virtualized engine for low overhead (VELO). The FPGA device is directly connected to the CPUs by a hypertransport link. The described hardware architecture is optimized for small messages and avoids the overhead typically found with direct-memory access (DMA) controlled transfers. The stateless approach allows to use the hardware unit directly from many threads and processes simultaneously. It provides a secure user level communication with an extremely optimized start-up phase. Micro benchmarks results are reported both based on proprietary API and OpenMPI basis.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126561028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guojing Cong, Sreedhar B. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, Tong Wen
{"title":"Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing","authors":"Guojing Cong, Sreedhar B. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, Tong Wen","doi":"10.1109/ICPP.2008.88","DOIUrl":"https://doi.org/10.1109/ICPP.2008.88","url":null,"abstract":"Solving large, irregular graph problems efficiently is challenging. Current software systems and commodity multiprocessors do not support fine-grained, irregular parallelism well. We present XWS, the X10 work stealing framework, an open-source runtime for the parallel programming language X10 and a library to be used directly by application writers. XWS extends the Cilk work-stealing framework with several features necessary to efficiently implement graph algorithms, viz., support for improperly nested procedures, global termination detection, and phased computation. We also present a strategy to adaptively control the granularity of parallel tasks in the work-stealing scheme, depending on the instantaneous size of the work queue. We compare the performance of the XWS implementations of spanning tree algorithms with that of the hand-written C and Cilk implementations using various graph inputs. We show that XWS programs (written in Java) scale and exhibit comparable or better performance.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123355907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Preissl, Thomas Köckerbauer, M. Schulz, D. Kranzlmüller, B. Supinski, D. Quinlan
{"title":"Detecting Patterns in MPI Communication Traces","authors":"Robert Preissl, Thomas Köckerbauer, M. Schulz, D. Kranzlmüller, B. Supinski, D. Quinlan","doi":"10.1109/ICPP.2008.71","DOIUrl":"https://doi.org/10.1109/ICPP.2008.71","url":null,"abstract":"Since processor counts in supercomputers are increasing dramatically, efficient interprocessor communication is becoming even more important for the applications that run on them. A high level, abstract understanding of an application's communication behavior would not only simplify debugging of that communication but would also support more directed performance optimization. We explore automated identification of communication patterns to provide that high level abstraction. We introduce an algorithm to extract communication patterns from MPI traces automatically. Our algorithm first finds locally repeating sequences and then iteratively grows them into global patterns. We demonstrate our technique on three realistic codes using traces from up to 128 processors. Our results show that our approach detects the underlying communication pattern within reasonable time and memory constraints, even for large trace sizes.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131526549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Vydyanathan, Ümit V. Çatalyürek, T. Kurç, P. Sadayappan, J. Saltz
{"title":"A Duplication Based Algorithm for Optimizing Latency Under Throughput Constraints for Streaming Workflows","authors":"N. Vydyanathan, Ümit V. Çatalyürek, T. Kurç, P. Sadayappan, J. Saltz","doi":"10.1109/ICPP.2008.68","DOIUrl":"https://doi.org/10.1109/ICPP.2008.68","url":null,"abstract":"Scheduling, in many application domains, involves the optimization of multiple performance metrics. For example, application workflows with real-time constraints have strict throughput requirements and also desire a low latency or response time. In this paper, we present a novel algorithm for the scheduling of workflows that act on a stream of input data. Our algorithm focuses on the two performance metrics: latency and throughput, and minimizes the latency of workflows while satisfying strict throughput requirements. We leverage pipelined, task and data parallelism in a coordinated manner to meet these objectives and investigate the benefit of task duplication in alleviating communication overheads in the pipelined schedule for different workflow characteristics. The proposed algorithm is designed for a realistic k-port communication model, where each processor can simultaneously communicate with at most k distinct processors. Evaluation using synthetic and application benchmarks shows that our algorithm consistently produces lower-latency schedules and meets throughput requirements, even when previously proposed schemes fail.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126418591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Software Speculation for Enhancing the Cost-Efficiency of Behavior-Oriented Parallelization","authors":"Yunlian Jiang, Xipeng Shen","doi":"10.1109/ICPP.2008.50","DOIUrl":"https://doi.org/10.1109/ICPP.2008.50","url":null,"abstract":"Recently, software speculation has shown promising results in parallelizing complex sequential programs by exploiting dynamic high-level parallelism. The speculation however is cost-inefficient. Failed speculations may cause unnecessary shared resource contention, power consumption, and interference to co-running applications. In this work, we propose adaptive speculation and design two algorithms to predict the profitability of a speculation and dynamically disable and enable the speculation of a region. Experimental results demonstrate significant improvement of computation efficiency without performance degradation. The adaptive speculation can also enhance the usability of behavior-oriented parallelization by allowing more flexibility in labeling possibly parallel regions.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132157400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}