{"title":"An Incentive-Compatible Mechanism for Scheduling Non-Malleable Parallel Jobs with Individual Deadlines","authors":"T. E. Carroll, Daniel Grosu","doi":"10.1109/ICPP.2008.27","DOIUrl":"https://doi.org/10.1109/ICPP.2008.27","url":null,"abstract":"We design an incentive-compatible mechanism for schedulingn non-malleable parallel jobs on a parallel system comprising m identical processors. Each job is owned by a selfish user who is rational: she performs actions that maximize her welfare even though doing so may cause system-wide suboptimal performance. Each job is characterized by four parameters: value, deadline, number of processors, and execution time. The user's welfare increases by the amount indicated by the value if her job can be completed by the deadline. The user declares theparameters to the mechanism which uses them to compute the schedule and the payments. The user can misreport the parameters, but since the mechanism is incentive-compatible, she chooses to truthfully declare them. We prove the properties of the mechanism and perform a study by simulation.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117016670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Network Reconfiguration Suitability for Scientific Applications","authors":"Héctor Montaner, F. Silla, V. Santonja, J. Duato","doi":"10.1109/ICPP.2008.58","DOIUrl":"https://doi.org/10.1109/ICPP.2008.58","url":null,"abstract":"This paper analyzes the communication pattern of several scientific applications and how they can make profit of network reconfiguration in order to adapt network topology to the communication needs so that total execution time is reduced. By using an analysis methodology based on real application executions, we study the variation of the required communication bandwidth with time and also the global interprocedural communication patterns. Results show that required bandwidth between each pair of processes does not significantly fluctuates, leading to a constant use of the links and therefore discouraging dynamic reconfigurations of the network during execution time. Nevertheless, the group of busy links changes with each application showing a different communication graph for each of them. Thus, execution time may be accelerated by using an ad-hoc topology, that is, reconfiguring the network before the execution of the application in order to adapt it to the application needs.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133332346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Chen, S. Byna, Xian-He Sun, R. Thakur, W. Gropp
{"title":"Exploring Parallel I/O Concurrency with Speculative Prefetching","authors":"Yong Chen, S. Byna, Xian-He Sun, R. Thakur, W. Gropp","doi":"10.1109/ICPP.2008.54","DOIUrl":"https://doi.org/10.1109/ICPP.2008.54","url":null,"abstract":"Parallel applications can benefit greatly from massive computational capability, but their performance usually suffers due to large latency in I/O accesses. Conventional I/O prefetching techniques are conservative and are limited by low accuracy and coverage. As the processor performance has been increasing rapidly and the computing power is virtually free, we introduce a novel speculative approach for comprehensive and aggressive parallel I/O prefetching in this study. We present the design of our approach as well as challenges, solutions, and our prototype implementation. The experiments have shown promising results in reducing I/O access latency.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132380990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Seal, M. Moody, A. Ceguerra, S. Ringer, K. Rajan, S. Aluru
{"title":"Tracking Nanostructural Evolution in Alloys: Large-Scale Analysis of Atom Probe Tomography Data on Blue Gene/L","authors":"S. Seal, M. Moody, A. Ceguerra, S. Ringer, K. Rajan, S. Aluru","doi":"10.1109/ICPP.2008.73","DOIUrl":"https://doi.org/10.1109/ICPP.2008.73","url":null,"abstract":"The advent of Local Electrode Atom Probe (LEAP) tomography is revolutionizing materials science by enabling near atomic scale imaging of materials. Analysis of three-dimensional atom probe tomography (APT) data holds the promise of relating combinatorial arrangement of atoms to material properties and enable better design and synthesis of complex materials. Existing techniques, which are serial and require O(n2) work for n atoms, do not scale to the hundred million large data sets produced by current generation atom probe microscopes. In this paper, we present an O(n) work autocorrelation based technique that reveals clustering of constituent atoms and spatial associations between them. We present an efficient parallelization of this method and show scaling on a 1,024 node Blue Gene/L. To our knowledge, this is the first parallel algorithm for the analysis of APT data, and together with our linear work autocorrelation technique, is demonstrated to easily scale to billion atom data sets expected in the very near future.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126042390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mapping Algorithms for Multiprocessor Tasks on Multi-Core Clusters","authors":"Jörg Dümmler, T. Rauber, G. Rünger","doi":"10.1109/ICPP.2008.42","DOIUrl":"https://doi.org/10.1109/ICPP.2008.42","url":null,"abstract":"In this paper, we explore the use of hierarchically structured multiprocessor tasks (M-tasks) for programming multi-core cluster systems.These systems often have hierarchically structured interconnection networks combining different computing resources, starting with the interconnect within multi-core processors up to the interconnection network combining nodes of the cluster or supercomputer. M-task programs can support the effective use of the computing resources by adapting the task structure of the program to the hierarchical organization of the cluster system and by exploiting the available data parallelism within the M-tasks. In particular, we consider different mapping algorithms for M-tasks and investigate the resulting efficiency and scalability. We present experimental results for different application programs and different multi-core systems.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128714167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelization and Characterization of Probabilistic Latent Semantic Analysis","authors":"Chuntao Hong, Wenguang Chen, Weimin Zheng, Jiulong Shan, Yurong Chen, Yimin Zhang","doi":"10.1109/ICPP.2008.8","DOIUrl":"https://doi.org/10.1109/ICPP.2008.8","url":null,"abstract":"Probabilistic Latent Semantic Analysis (PLSA) is one of the most popular statistical techniques for the analysis of two-model and co-occurrence data. It has applications in information retrieval and filtering, nature language processing, machine learning from text, and other related areas. However, PLSA is rarely applied to large datasets due to its high computational complexity.This paper presents an optimized and parallelized implementation of PLSA which is capable of processing datasets with 10000 documents in seconds. Compared to the baseline program, our parallelized program can achieve speedup of more than six on an eight-processor machine. The characterization of the parallel program is also presented. The performance analysis of the parallel program indicates that this program is memory intensive and the limited memory bandwidth is the bottleneck for better speedup.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125501855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bridging the Gap Between Parallel File Systems and Local File Systems: A Case Study with PVFS","authors":"Peng Gu, Jun Wang, R. Ross","doi":"10.1109/ICPP.2008.43","DOIUrl":"https://doi.org/10.1109/ICPP.2008.43","url":null,"abstract":"Parallel I/O plays an increasingly important role in today's data intensive computing applications. While much attention has been paid to parallel read performance, most of this work has focused on the parallel file system, middleware, or application layers, ignoring the potential for improvement through more effective use of local storage. In this paper, we present the design and implementation of segment-structured on-disk data grouping and prefetching (SOGP), a technique that leverages additional local storage to boost the local data read performance for parallel file systems, especially for those applications with partially overlapped access patterns. Parallel virtual file system (PVFS) is chosen as an example. Our experiments show that an SOGP-enhanced PVFS prototype system can outperforma traditional Linux-Ext3-based PVFS for many applications and benchmarks, in some tests by as much as 230% in terms of I/O bandwidth.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123009885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accommodation of the Bandwidth of Large Cache Blocks Using Cache/Memory Link Compression","authors":"Martin Thuresson, P. Stenström","doi":"10.1109/ICPP.2008.47","DOIUrl":"https://doi.org/10.1109/ICPP.2008.47","url":null,"abstract":"The mismatch between processor and memory speed continues to make design issues for memory hierarchies important. While larger cache blocks can exploit more spatial locality, they increase the off-chip memory bandwidth; a scarce resource in future microprocessor designs. We show that it is possible to use larger block sizes without increasing the off-chip memory bandwidth by applying compression techniques to cache/memory block transfers. Since bandwidth is reduced by up to a factor of three, we propose to use larger blocks. While compression/decompression ends up on the critical memory access path, we find that its negative impact on the memory access latency time is often dwarfed by the performance gains from larger block sizes. Our proposed scheme uses a previous mechanism for dynamically choosing a larger cache block when advantageous given the spatial locality in combination with compression. This combined scheme consistently improves performance on average by 19%.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117215561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas M. DuBois, Bryant C. Lee, Yi Wang, M. Olano, U. Vishkin
{"title":"XMT-GPU: A PRAM Architecture for Graphics Computation","authors":"Thomas M. DuBois, Bryant C. Lee, Yi Wang, M. Olano, U. Vishkin","doi":"10.1109/ICPP.2008.35","DOIUrl":"https://doi.org/10.1109/ICPP.2008.35","url":null,"abstract":"The shading processors in graphics hardware are becoming increasingly general-purpose. We test, through simulation and benchmarking, the potential performance impact of replacing these processors with a fully general-purpose parallel processor, without the fixed-function graphics hardware legacy of current graphics processing units (GPUs). The representative general-purpose processor we test against is XMT (for explicit multi-threading), a PRAM-like single-chip parallel architecture. Performance is compared for two characteristic shaders running in a fragment-limited GPU benchmark harness and on a cycle-accurate XMT simulator. The general-purpose processor is found to be significantly faster at a compute-only shader, but slower on a memory bound texture shader. Finally we analyze the design tradeoffs that would allow combining the best of both worlds: (i) a competitive XMT texture shader, with (ii) a general-purpose easy-to-program XMT many-core approach that scales up or down to the amount of parallelism provided by the application and is even compatible with serial code.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132768107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christos Kotselidis, Mohammad Ansari, Kim Jarvis, M. Luján, C. Kirkham, I. Watson
{"title":"DiSTM: A Software Transactional Memory Framework for Clusters","authors":"Christos Kotselidis, Mohammad Ansari, Kim Jarvis, M. Luján, C. Kirkham, I. Watson","doi":"10.1109/ICPP.2008.59","DOIUrl":"https://doi.org/10.1109/ICPP.2008.59","url":null,"abstract":"While transactional memory (TM) research on shared-memory chip multiprocessors has been flourishing over the last years,limited research has been conducted in the cluster domain. In this paper,we introduce a research platform for exploiting software TMon clusters. The distributed software transactional memory (DiSTM) system has been designed for easy prototyping of TM coherence protocols and it does not rely on a software or hardware implementation of distributed shared memory. Three TM coherence protocols have been implemented and evaluated with established TM benchmarks. The decentralized transactional coherence and consistency protocol has been compared against two centralized protocols that utilize leases. Results indicate that depending on network congestion and amount of contention different protocols perform better.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125063747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}