{"title":"Investigating Scenario-Conscious Asynchronous Rendezvous over RDMA","authors":"Judicael A. Zounmevo, A. Afsahi","doi":"10.1109/CLUSTER.2011.65","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.65","url":null,"abstract":"In this paper, we propose a light-weight asynchronous message progression mechanism for large message transfers in Message Passing Interface (MPI) Rendezvous protocol that is scenario-conscious and consequently overhead-free in cases where independent message progression naturally happens. Without requiring a dedicated thread, we take advantage of small bursts of CPU to poll for message transfer conditions. The existing application thread is parasitized for the purpose of getting those small bursts of CPU. Our proposed approach is only triggered when the message transfer would otherwise be deferred to the MPI wait call, and it allows for full message progression, achieving 100% overlap. It does not add to the memory footprint of the applications, and is effective in improving the communication performance of most of the applications studied in this paper.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126588883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Tian, S. Klasky, H. Abbasi, J. Lofstead, R. Grout, N. Podhorszki, Qing Liu, Yandong Wang, Weikuan Yu
{"title":"EDO: Improving Read Performance for Scientific Applications through Elastic Data Organization","authors":"Yuan Tian, S. Klasky, H. Abbasi, J. Lofstead, R. Grout, N. Podhorszki, Qing Liu, Yandong Wang, Weikuan Yu","doi":"10.1109/CLUSTER.2011.18","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.18","url":null,"abstract":"Large scale scientific applications are often bottlenecked due to the writing of checkpoint-restart data. Much work has been focused on improving their write performance. With the mounting needs of scientific discovery from these datasets, it is also important to provide good read performance for many common access patterns, which requires effective data organization. To address this issue, we introduce Elastic Data Organization (EDO), which can transparently enable different data organization strategies for scientific applications. Through its flexible data ordering algorithms, EDO harmonizes different access patterns with the underlying file system. Two levels of data ordering are introduced in EDO. One works at the level of data groups (a.k.a process groups). It uses Hilbert Space Filling Curves (SFC) to balance the distribution of data groups across storage targets. Another governs the ordering of data elements within a data group. It divides a data group into sub chunks and strikes a good balance between the size of sub chunks and the number of seek operations. Our experimental results demonstrate that EDO is able to achieve balanced data distribution across all dimensions and improve the read performance of multidimensional datasets in scientific applications.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116640214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Hybrid OpenMP + MPI Program Generation for Dynamic Programming Problems","authors":"Dennis Vandenberg, Q. Stout","doi":"10.1109/CLUSTER.2011.28","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.28","url":null,"abstract":"We describe a program that automatically generates a hybrid OpenMP + MPI program for a class of recursive calculations with template dependencies. Many useful generalized dynamic programming problems fit this category, such as Multiple String Alignment and multi-arm Bernoulli Bandit problems. Solving problems like these, especially those involving several dimensions, can use a significant amount of memory and time. Our generator addresses these issues by dividing the problem into many tiles that can be solved in parallel. Programs generated using this program generator are capable of solving large problems and achieve good scalability when run on many cores. The input supplied to the generator is a high level description of the problem, the output is a fully functioning parallel program for a cluster of shared memory nodes. This high level approach to parallel computation allows the generator to have a large amount of control over memory allocation, load balancing and calculation ordering.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123030280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinjin Fu, Hong Jiang, Nong Xiao, Lei Tian, F. Liu
{"title":"AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment","authors":"Yinjin Fu, Hong Jiang, Nong Xiao, Lei Tian, F. Liu","doi":"10.1109/CLUSTER.2011.20","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.20","url":null,"abstract":"The market for cloud backup services in the personal computing environment is growing due to large volumes of valuable personal and corporate data being stored on desktops, laptops and smart phones. Source deduplication has become a mainstay of cloud backup that saves network bandwidth and reduces storage space. However, there are two challenges facing deduplication for cloud backup service clients: (1) low deduplication efficiency due to a combination of the resource-intensive nature of deduplication and the limited system resources on the PC-based client site, and (2) low data transfer efficiency since post-deduplication data transfers from source to backup servers are typically very small but must often cross a WAN. In this paper, we present AA-Dedupe, an application-aware source deduplication scheme, to significantly reduce the computational overhead, increase the deduplication throughput and improve the data transfer efficiency. The AA-Dedupe approach is motivated by our key observations of the substantial differences among applications in data redundancy and deduplication characteristics, and thus is based on an application-aware index structure that effectively exploits this application awareness. Our experimental evaluations, based on an AA-Dedupe prototype implementation, show that our scheme can improve deduplication efficiency over the state-of-art source-deduplication methods by a factor of 2-7, resulting in shortened backup window, increased power-efficiency and reduced cost for cloud backup services.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126978214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Cao, Hongwei Tang, Qiang Li, Bo-Zhang Li, Fei Chen, Kai Wang, Xuejun An, Ninghui Sun
{"title":"Design of HPC Node with Heterogeneous Processors","authors":"Zheng Cao, Hongwei Tang, Qiang Li, Bo-Zhang Li, Fei Chen, Kai Wang, Xuejun An, Ninghui Sun","doi":"10.1109/CLUSTER.2011.23","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.23","url":null,"abstract":"Heterogeneous Computing is becoming an important technology trend in HPC, where more and more heterogeneous processors are used. However, in traditional node architecture, heterogeneous processors are always used as coprocessors. Such usage increases the communication latency between heterogeneous processors and prevents the node from achieving high density. With the purpose of improving communication efficiency between heterogeneous processors, this paper proposed a new node architecture named HeteNode. In HeteNode, general purpose processors and heterogeneous processors are interconnected by a system controller directly and play the same role in both process of communication and process of computation. The prototype of HeteNode which contains nine processors in 1U chassis is built. Evaluation carried out on the prototype shows that 580ns minimum intra-node latency and 1.78us minimum inter-node latency between heterogeneous processors are achieved. Besides, NPB benchmarks show good scalability in HeteNode.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116526485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Singh, S. Potluri, Hao Wang, K. Kandalla, S. Sur, D. Panda
{"title":"MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit","authors":"A. Singh, S. Potluri, Hao Wang, K. Kandalla, S. Sur, D. Panda","doi":"10.1109/CLUSTER.2011.67","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.67","url":null,"abstract":"General Purpose Graphics Processing Units (GPGPUs) are rapidly becoming an integral part of high performance system architectures. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [1]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Typically, scientific applications spend a significant portion of their execution time in collective communication. Hence, optimizing performance of collectives has a significant impact on their performance. MPI_Alltoall is a heavily used collective that has O(N2) communication, for N processes. In this paper, we outline the major design alternatives for MPI_Alltoall collective communication operation on GPGPU clusters. We propose three design alternatives and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI_Alltoall on GPU clusters can be improved by 44% over a user level implementation and 31% over a send-recv based implementation for 256 KByte messages on 8 processes.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115163344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergio Herrero-Lopez, John R. Williams, Abel Sanchez
{"title":"Large-Scale Simulator for Global Data Infrastructure Optimization","authors":"Sergio Herrero-Lopez, John R. Williams, Abel Sanchez","doi":"10.1109/CLUSTER.2011.15","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.15","url":null,"abstract":"IT infrastructures in global corporations are appropriately compared with nervous systems, in which body parts (interconnected datacenters) exchange signals (request responses) in order to coordinate actions (data visualization and manipulation). A priori inoffensive perturbations in the operation of the system or the elements composing the infrastructure can lead to catastrophic consequences. Downtime disables the capability of clients reaching the latest versions of the data and/or propagating their individual contributions to other clients, potentially costing millions of dollars to the organization affected. The imperative need of guaranteeing the proper functioning of the system not only forces to pay particular attention to network outages, hot-objects or application defects, but also slows down the deployment of new capabilities, features and equipment upgrades. Under these circumstances, decision cycles for these modifications can be extremely conservative, and be prolonged for years, involving multiple authorities across departments of the organization. Frequently, the solutions adopted are years behind state-of-the art technologies or phased out compared to leading research on the IT infrastructure field. In this paper, the utilization of a large-scale data infrastructure simulator is proposed, in order to evaluate the impact of \" what if\" scenarios on the performance, availability and reliability of the system. The goal is to provide data center operators a tool that allows understanding and predicting the consequences of the deployment of new network topologies, hardware configurations or software applications in a global data infrastructure, without affecting the service. The simulator was constructed using a multi-layered approach, providing a granularity down to the individual server component and client action, and was validated against a downscaled version of the data infrastructure of a Fortune 500 company.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133935657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving Scalable Parallelization for the Hessenberg Factorization","authors":"A. Castaldo, R. Clint Whaley","doi":"10.1109/CLUSTER.2011.16","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.16","url":null,"abstract":"Much of dense linear algebra has been successfully blocked to concentrate the majority of its time in the Level~3 BLAS, which are not only efficient for serial computation, but also scale well for parallelism. For the Hessenberg factorization, which is a critical step in computing the eigenvalues and vectors, however, performance of the best known algorithm is still strongly limited by the memory speed, which does not tend to scale well at all. In this paper we present an adaptation of our Parallel Cache Assignment (PCA) technique to the Hessenberg factorization, and show that it achieves super linear speedup over the corresponding serial algorithm and a more than four-fold speedup over the best known algorithm for small and medium sized problems.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134472299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling Workflows in Opportunistic Environments","authors":"María M. López, E. Heymann, M. A. Senar","doi":"10.1109/CLUSTER.2011.72","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.72","url":null,"abstract":"Workflow applications exhibit both high computation times and data transfer rates. For this reason, the completion time of the workflow is high. To reduce completion time, the tasks of a workflow ought to run on different machines interconnected by a network. Correct assignment of tasks to machines within the runtime environment is an important aspect in the completion time or make span. The manager making the assignment is the scheduler. The main problem of a static scheduler is that it ignores the changes that occur in the execution environment during DAG execution. To solve this problem, we developed a new dynamic scheduler. This dynamic scheduler monitors the behavior of the tasks executed as well as the execution environment, and it reacts to the changes detected by adapting the scheduling of the rest of pending tasks. The objective is to reduce the overhead incurred by excessive self-adaptations, without affecting the make span. To reduce overhead, the algorithm self-adapts only when an improvement in make span is expected. The proposed policies have been simulated and then executed in a real environment. These executions have achieved a reduction of the overhead of greater than 20%.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121426112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quartile and Outlier Detection on Heterogeneous Clusters Using Distributed Radix Sort","authors":"Kyle Spafford, J. Meredith, J. Vetter","doi":"10.1109/CLUSTER.2011.53","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.53","url":null,"abstract":"In the past few years, performance improvements in CPUs and memory technologies have outpaced those of storage systems. When extrapolated to the exascale, this trend places strict limits on the amount of data that can be written to disk for full analysis, resulting in an increased reliance on characterizing in-memory data. Many of these characterizations are simple, but require sorted data. This paper explores an example of this type of characterization -- the identification of quartiles and statistical outliers -- and presents a performance analysis of a distributed heterogeneous radix sort as well as an assessment of current architectural bottlenecks.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129679950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}