Tonglin Li, Chaoqi Ma, Jiabao Li, Xiaobing Zhou, Ke Wang, Dongfang Zhao, Iman Sadooghi, I. Raicu
{"title":"GRAPH/Z: A Key-Value Store Based Scalable Graph Processing System","authors":"Tonglin Li, Chaoqi Ma, Jiabao Li, Xiaobing Zhou, Ke Wang, Dongfang Zhao, Iman Sadooghi, I. Raicu","doi":"10.1109/CLUSTER.2015.90","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.90","url":null,"abstract":"The emerging applications in big data and social networks issue rapidly increasing demands on graph processing. Graph query operations that involve a large number of vertices and edges can be tremendously slow on traditional databases. The state-of-the-art graph processing systems and databases usually adopt master/slave architecture that potentially impairs their The contributions of this paper are as follows: scalability. This work describes the design and implementation of a new graph processing system based on Bulk Synchronous Parallel model. Our system is built on top of ZHT, a scalable distributed key-value store, which benefits the graph processing in terms of scalability, performance and persistency. The experiment results imply excellent scalability.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124791454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Highly Scalable Parallel Search-Tree Algorithms: The Virtual Topology Approach","authors":"F. Abu-Khzam, A. E. Mouawad, Karim A. Jahed","doi":"10.1109/CLUSTER.2015.91","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.91","url":null,"abstract":"Summary form only given. We introduce the notion of a virtual topology and explore the use of search-tree indexing to achieve highly scalable parallel search-tree algorithms for NP-hard problems. Vertex Cover and Cluster Editing are used as case studies.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122216477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florin Isaila, Prasanna Balaprakash, Stefan M. Wild, D. Kimpe, R. Latham, R. Ross, P. Hovland
{"title":"Collective I/O Tuning Using Analytical and Machine Learning Models","authors":"Florin Isaila, Prasanna Balaprakash, Stefan M. Wild, D. Kimpe, R. Latham, R. Ross, P. Hovland","doi":"10.1109/CLUSTER.2015.29","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.29","url":null,"abstract":"The optimization of parallel I/O has become challenging because of the increasing storage hierarchy, performance variability of shared storage systems, and the number of factors in the hardware and software stacks that impact performance. In this paper, we perform an in-depth study of the complexity involved in I/O autotuning and performance modeling, including the architecture, software stack, and noise. We propose a novel hybrid model combining analytical models for communication and storage operations and black-box models for the performance of the individual operations. The experimental results show that the hybrid approach performs significantly better and shows a higher robustness to noise than state-of-the-art machine learning approaches, at the cost of a higher modeling complexity.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116477926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Laue Depth Reconstruction Algorithm with CUDA","authors":"Ke Yue, N. Schwarz, J. Tischler","doi":"10.1109/CLUSTER.2015.78","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.78","url":null,"abstract":"The Laue diffraction microscopy experiment uses the polychromatic Laue micro-diffraction technique to examine the structure of materials with sub-micron spatial resolution in all three dimensions. During this experiment, local crystallographic orientations, orientation gradients and strains are measured as properties which will be recorded in HDF5 image format. The recorded images will be processed with a depth reconstruction algorithm for future data analysis. But the current depth reconstruction algorithm consumes considerable processing time and might take up to 2 weeks for reconstructing data collected from one single experiment. To improve the depth reconstruction computation speed, we propose a scalable GPU program solution on the depth reconstruction problem in this paper. The test result shows that the running time would be 10 to 20 times faster than the prior CPU design for various size of input data.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133236039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Push Me Pull You: Integrating Opposing Data Transport Modes for Efficient HPC Application Monitoring","authors":"O. Aaziz, J. Cook, Hadi Sharifi","doi":"10.1109/CLUSTER.2015.118","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.118","url":null,"abstract":"While HPC system monitoring is a necessary and accepted practice, applications are still basically opaque in the production environment. For better HPC platform management and utilization, especially as platforms push towards exascale size, HPC applications need to be more transparent in their execution in the production environment. PROMON is a framework for application monitoring in the production environment, but its design concentrated on the front end issues of offering easy to use application instrumentation. This paper presents the integration of PROMON with LDMS, a proven efficient HPC system monitoring framework. PROMON and LDMS offer a case study in integrating two disparate instrumentation and monitoring models, and the lessons are applicable to other HPC monitoring issues.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"13 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114017297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Aware Job Management Approaches for Workflow in Cloud","authors":"M. Khaleel, Mengxia Zhu","doi":"10.1109/CLUSTER.2015.85","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.85","url":null,"abstract":"The energy consumption of cloud servers has dramatically increased. In order to meet the growing demands of users and reduce the skyrocketing cost of electricity, it is critical to have performance guaranteed and cost-effective job schedulers for clouds. In recent years, there has been a growing body of research which focus on improving resource utilization to improve energy efficiency, system throughput and at the same time meet the Quality of Service (QoS) requirements specified in the Service Level Agreements (SLA). This paper propose a multiple procedure scheduling algorithm which aims to maximize the resource utilization for cloud resources for reduced energy consumption as well as guarantee the execution deadline for cloud jobs modeled as scientific workflows. Our simulation results demonstrate better performance compared with other similar algorithms.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116217187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Namashivayam, Deepak Eachempati, Dounia Khaldi, B. Chapman
{"title":"OpenSHMEM as a Portable Communication Layer for PGAS Models: A Case Study with Coarray Fortran","authors":"N. Namashivayam, Deepak Eachempati, Dounia Khaldi, B. Chapman","doi":"10.1109/CLUSTER.2015.66","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.66","url":null,"abstract":"Languages and libraries based on the Partitioned Global Address Space (PGAS) programming model have emerged in recent years with a focus on addressing the programming challenges for scalable parallel systems. Among these, Coarray Fortran (CAF) is unique in that as it has been incorporated into an existing standard (Fortran 2008), and therefore it is of particular importance that implementations supporting it are both portable and deliver sufficient levels of performance. OpenSHMEM is a library which is the culmination of a standardization effort among many implementers and users of SHMEM, and it provides a means to develop light-weight, portable, scalable applications based on the PGAS programming model. As such, we propose here that OpenSHMEM is well situated to serve as a runtime substrate for CAF implementations. In this paper, we demonstrate how OpenSHMEM can be exploited as a runtime layer upon which CAF may be implemented. Specifically, we re-targeted the CAF implementation provided in the OpenUH compiler to OpenSHMEM, and show how parallel language features provided by CAF may be directly mapped to OpenSHMEM, including allocation of remotely accessible objects, one-sided communication, and various types of synchronization. Moreover, we present and evaluate various algorithms we developed for implementing remote access of non-contiguous array sections and acquisition and release of remote locks using the OpenSHMEM interface.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116709787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Development of Domain Specific Active Libraries with Proxy Applications","authors":"I. Reguly, G. Mudalige, M. Giles","doi":"10.1109/CLUSTER.2015.128","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.128","url":null,"abstract":"Representative applications are versatile tools to evaluate new programming approaches, techniques and optimisations as a way to ensure continued high performance on future computing architectures. They make experimentation much easier before adopting changes/insights into the large scientific codes. In this paper we demonstrate the important role played by representative/proxy applications in designing and developing two high-level programming approaches: namely the OP2 and OPS domain specific (active) libraries. OP2 and OPS utilizes code generation techniques to produce automatic parallelisations from a high-level abstract problem declaration. The strategy delivers significant developer productivity to the domain scientist, while at the same time allowing computational experts to adopt the latest programming models and hardware-specific optimisations into the library and code generation tools to achieve near optimal performance. We show how representative applications have been a cornerstone in the development of OP2 and OPS and chart our experiences. In particular, we demonstrate how the range of hand-tuned optimized parallelisations of the CloverLeaf hydrodynamics mini-app allowed us to gain clear evidence that the OPS based code generated parallelisations were indeed as optimal as the hand-tuned versions. Additionally, with the use of a representative application from the CFD domain we demonstrate how the optimisations discovered and applied to proxy apps are indeed directly transferable to a large-scale industrial application at Rolls Royce plc. These results provide significant evidence into the utility of representative applications to improve productivity, enable performance portability and ultimately future-proof scientific applications.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128304267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arash Rezaei, F. Mueller, Paul H. Hargrove, Eric Roman
{"title":"DINO: Divergent Node Cloning for Sustained Redundancy in HPC","authors":"Arash Rezaei, F. Mueller, Paul H. Hargrove, Eric Roman","doi":"10.1109/CLUSTER.2015.36","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.36","url":null,"abstract":"Soft faults like silent data corruption and hard faults like hardware failures may cause a high performance computing (HPC) job of thousands of processes to nearly cease to make progress due to recovery overheads. Redundant computing has been proposed as a solution at extreme scale by allocating two or more processes to perform the same task. However, current redundant computing approaches do not repair failed replicas. Thus, SDC-free execution is not guaranteed after a replica failure and the job may finish with incorrect results. Replicas are logically equivalent, yet may have divergent runtime states during job execution, which complicates on-the-fly repairs for forward recovery. In this work, we present a redundant execution environment that quickly repairs hard failures via Divergent Node cloning (DINO) at the MPI task level. DINO contributes a novel task cloning service integrated into the MPI runtime system that solves the problem of consolidating divergent states among replicas on-the-fly. Experimental results indicate that DINO can recover from failures nearly instantaneously, thus retaining the redundancy level throughout job execution. The cloning overhead, depending on the process image size and its transfer rate, ranges from 5.60 to 90.48 seconds. To the best of our knowledge, the design and implementation for repairing failed replicas in redundant MPI computing is unprecedented.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128410210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Peraza, Ananta Tiwari, W. A. Ward, R. Campbell, L. Carrington
{"title":"VecMeter: Measuring Vectorization on the Xeon Phi","authors":"Joshua Peraza, Ananta Tiwari, W. A. Ward, R. Campbell, L. Carrington","doi":"10.1109/CLUSTER.2015.73","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.73","url":null,"abstract":"Wide vector units in Intel's Xeon Phi accelerator cards can significantly boost application performance when used effectively. However, there is a lack of performance tools that provide programmers accurate information about the level of vectorization in their codes. This paper presents VecMeter, an easy-to-use tool to measure vectorization on the Xeon Phi. VecMeter utilizes binary instrumentation and therefore does not require source code modifications. This paper describes the design of VecMeter, demonstrates its accuracy, defines a metric for quantifying vectorization, and provides an example where the tool can guide code optimization to improve performance by up to 33%.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128102653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}