{"title":"Adjustable Credit Scheduling for High Performance Network Virtualization","authors":"Zhibo Chang, Jian Li, Ruhui Ma, Zhi-Jian Huang, Haibing Guan","doi":"10.1109/CLUSTER.2012.27","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.27","url":null,"abstract":"Virtualization technology is now widely adopted in cloud computing to support heterogeneous and dynamic workload. The scheduler in a virtual machine monitor (VMM) plays an important role in allocating resources. However, the type of applications in virtual machines (VM) is unknown to the scheduler, and I/O-intensive and CPU-intensive applications are treated the same. This makes virtual systems unable to take full advantage of high performance networks such as 10-Gigabit Ethernet. In this paper, we review the SR-IOV networking solution and show by experiment that the current credit scheduler in Xen does not utilize high performance networks efficiently. For this reason, we propose a novel scheduling model with two optimizations to eliminate the bottleneck caused by scheduler. In this model, guest domains are divided into I/O-intensive domains and CPU-intensive domains according to their monitored behaviour. I/O-intensive domains can obtain extra credits that CPU-intensive domains are willing to share. Besides, the total available credits is adjusted agilely to accelerate the I/O responsiveness. Our experimental evaluation with benchmarks shows that the new scheduling model improves bandwidth even when the system's load is very high.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"78 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132432885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HAaaS: Towards Highly Available Distributed Systems","authors":"Yaoguang Wang, Weiming Lu, Bin-bin Yu, Baogang Wei","doi":"10.1109/CLUSTER.2012.59","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.59","url":null,"abstract":"High availability is a valuable property in distributed systems. The master-slave model is used wildly in data management systems for high performance. However, many master-slave systems still have SPOF (Single Point of Failure) for the single master node. We exploit a generalized solution to meet several common use cases for different master-slave systems. The solution makes the high availability as a service (HAaaS), which uses a shared storage infrastructure to make the master stateless and provides an automatic fail over of high-availability service. We deploy the HAaaS in many master-slave subsystems in our unstructured data management system (UDMS) to make the UDMS highly available. The experiments demonstrate the feasibility and efficiency of our solution.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133761598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Javier Prades, F. Silla, J. Duato, H. Fröning, M. Nüssle
{"title":"A New End-to-End Flow-Control Mechanism for High Performance Computing Clusters","authors":"Javier Prades, F. Silla, J. Duato, H. Fröning, M. Nüssle","doi":"10.1109/CLUSTER.2012.15","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.15","url":null,"abstract":"High Performance Computing usually leverages messaging libraries such as MPI or GASNet in order to exchange data among processes in large-scale clusters. Furthermore, these libraries make use of specialized low-level networking layers in order to retrieve as much performance as possible from hardware interconnects such as Infini Band or Myrinet, for example. EXTOLL is another emerging technology targeted for high performance clusters. These specialized low-level networking layers require some kind of flow control in order to prevent buffer overflows at the received side. In this paper we present a new flow control mechanism that is able to adapt the buffering resources used by a process according to the parallel application communication pattern and the varying activity among communicating peers. The tests carried out in a 64-node 1024-core EXTOLL cluster show that our new dynamic flow-control mechanism provides extraordinarily high buffer efficiency along with very low overhead, which is reduced between 8 and 10 times.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133856276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kazutomo Yoshii, K. Iskra, Rinku Gupta, P. Beckman, V. Vishwanath, Chenjie Yu, S. Coghlan
{"title":"Evaluating Power-Monitoring Capabilities on IBM Blue Gene/P and Blue Gene/Q","authors":"Kazutomo Yoshii, K. Iskra, Rinku Gupta, P. Beckman, V. Vishwanath, Chenjie Yu, S. Coghlan","doi":"10.1109/CLUSTER.2012.62","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.62","url":null,"abstract":"Power consumption is becoming a critical factor as we continue our quest toward exascale computing. Yet, actual power utilization of a complete system is an insufficiently studied research area. Estimating the power consumption of a large scale system is a nontrivial task because a large number of components are involved and because power requirements are affected by the (unpredictable) workloads. Clearly needed is a power-monitoring infrastructure that can provide timely and accurate feedback to system developers and application writers so that they can optimize the use of this precious resource. Many existing large-scale installations do feature power-monitoring sensors, however, those are part of environmental- and health monitoring sub systems and were not designed with application level power consumption measurements in mind. In this paper, we evaluate the existing power monitoring of IBM Blue Gene systems, with the goal of understanding what capabilities are available and how they fare with respect to spatial and temporal resolution, accuracy, latency, and other characteristics. We find that with a careful choice of dedicated micro benchmarks, we can obtain meaningful power consumption data even on Blue Gene/P, where the interval between available data points is measured in minutes. We next evaluate the monitoring subsystem on Blue Gene/Q, and are able to study the power characteristics of FPU and memory subsystems of Blue Gene/Q. We find the monitoring subsystem capable of providing second-scale resolution of power data conveniently separated between node components with seven seconds latency. This represents a significant improvement in power monitoring infrastructure, and hope future systems will enable real-time power measurement in order to better understand application behavior at a finer granularity.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114348977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Rajachandrasekar, Jai Jaswani, H. Subramoni, D. Panda
{"title":"Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework","authors":"R. Rajachandrasekar, Jai Jaswani, H. Subramoni, D. Panda","doi":"10.1109/CLUSTER.2012.90","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.90","url":null,"abstract":"The rapid growth of supercomputing systems, both in scale and complexity, has been accompanied by degradation in system efficiencies. The sheer abundance of resources including millions of cores, vast amounts of physical memory and high-bandwidth networks are heavily under-utilized. This happens when the resources are time-shared amongst parallel applications that are scheduled to run on a subset of compute nodes in an exclusive manner. Several space-sharing techniques that have been proposed in the literature allow parallel applications to be co-located on compute nodes and share resources with each other. Although this leads to better system efficiencies, it also causes contention for system resources. In this work, we specifically address the problem of network contention, caused due to the sharing of network resources by parallel applications and file systems simultaneously. We leverage the Quality-of-Service (QoS) capabilities of the widely used Infini Band interconnect to enhance our data-staging file system, making it QoS-aware. This is a user-level framework that is agnostic of the file system and MPI implementation. Using this file system, we demonstrate the isolation of file system traffic from MPI communication traffic, thereby reducing the network contention. Experimental results show that MPI point-to-point latency can be reduced by up to 320 microseconds, and the bandwidth improved by up to 674MB/s in the presence of contention with I/O traffic. Furthermore, we were able to reduce the runtime of the AWP-ODC MPI application by about 9.89% in the presence of network contention, and also reduce the time spent in communication by the NAS CG kernel by 23.46%.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116282742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transactional Multi-row Access Guarantee in the Key-Value Store","authors":"Yaoguang Wang, Weiming Lu, Baogang Wei","doi":"10.1109/CLUSTER.2012.57","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.57","url":null,"abstract":"The emergence of Cloud Computing and Big Data drives the development of novel data stores named NoSQL. A mass of data stores are developed and the most are key-value stores, where the stores are partitioned with keys and a key can identify a row uniquely. However, the requirement for efficiency and scalability makes them only provide the single-row atomic access. But in the Big Data era, more and more applications built on the key-value stores need transactional functionality across multiple rows. So, it is natural to implement a multi-row transaction management for key-value stores. In this paper, we implement a transaction processing system (TrasPS) which guarantees the transactional multi-row access from the application client to the key-value store in our unstructured data management system (UDMS). We also provide fault tolerance and recovery for the transactions. The implementation and experiments in our UDMS show that TrasPS can provide scalable multi-row access functionality at a very low overhead.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123725240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Jenkins, James Dinan, P. Balaji, N. Samatova, R. Thakur
{"title":"Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments","authors":"John Jenkins, James Dinan, P. Balaji, N. Samatova, R. Thakur","doi":"10.1109/CLUSTER.2012.72","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.72","url":null,"abstract":"Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges GPU acceleration of large-scale scientific computations. A particular challenge is the transfer of noncontiguous data to and from GPU memory. MPI implementations currently do not provide an efficient means of utilizing data types for noncontiguous communication of data in GPU memory. To address this gap, we present an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU. We present a means for converting conventional data type representations into a GPU-amenable format. Fine-grained, element-level parallelism is then utilized by a GPU kernel to perform in-device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array sub volumes over CUDA-based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of data types that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end-to-end, GPU-to-GPU communication time. In addition, we identify and evaluate communication patterns that may cause resource contention with packing operations, providing a baseline for adaptively selecting data-processing strategies.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128525362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mastiff: A MapReduce-based System for Time-Based Big Data Analytics","authors":"Sijie Guo, Jin Xiong, Weiping Wang, Rubao Lee","doi":"10.1109/CLUSTER.2012.10","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.10","url":null,"abstract":"Existing MapReduce-based warehousing systems are not specially optimized for time-based big data analysis applications. Such applications have two characteristics: 1) data are continuously generated and are required to be stored persistently for a long period of time, 2) applications usually process data in some time period so that typical queries use time-related predicates. Time-based big data analytics requires both high data loading speed and high query execution performance. However, existing systems including current MapReduce-based solutions do not solve this problem well because the two requirements are contradictory. We have implemented a MapReduce-based system, called Mastiff, which provides a solution to achieve both high data loading speed and high query performance. Mastiff exploits a systematic combination of a column group store structure and a lightweight helper structure. Furthermore, Mastiff uses an optimized table scan method and a column-based query execution engine to boost query performance. Based on extensive experiments results with diverse workloads, we will show that Mastiff can significantly outperform existing systems including Hive, HadoopDB, and GridSQL.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129198827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Effects of CPU Caches on MPI Point-to-Point Communications","authors":"Simone Pellegrini, T. Hoefler, T. Fahringer","doi":"10.1109/CLUSTER.2012.22","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.22","url":null,"abstract":"Several researchers investigated the placing of communication calls in message-passing parallel codes. The current rule of thumb it to maximize communication/computation overlap with early binding. In this work, we demonstrate that this is not the only design constraint because CPU caches can have a significant impact on communications. We conduct an empirical study of the interaction between CPU caching and communications for several different communication scenarios. We use the gained insight to formulate a set of intuitive rules for communication call placement and show how our rules can be applied to practical codes. Our optimized codes show an improvement of up to 40% for a simple stencil code. Our work is a first step towards communication optimizations by moving communication calls. We expect that future communication-aware compilers will use our insights as a standard technique to move communication calls in order to optimize performance.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121565823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Mametjanov, Daniel Lowell, Ching-Chen Ma, B. Norris
{"title":"Autotuning Stencil-Based Computations on GPUs","authors":"A. Mametjanov, Daniel Lowell, Ching-Chen Ma, B. Norris","doi":"10.1109/CLUSTER.2012.46","DOIUrl":"https://doi.org/10.1109/CLUSTER.2012.46","url":null,"abstract":"Finite-difference, stencil-based discretization approaches are widely used in the solution of partial differential equations describing physical phenomena. Newton-Krylov iterative methods commonly used in stencil-based solutions generate matrices that exhibit diagonal sparsity patterns. To exploit these structures on modern GPUs, we extend the standard diagonal sparse matrix representation and define new matrix and vector data types in the PETSc parallel numerical toolkit. We create tunable CUDA implementations of the operations associated with these types after identifying a number of GPU-specific optimizations and tuning parameters for these operations. We discuss our implementation of GPU auto tuning capabilities in the Orio framework and present performance results for several kernels, comparing them with vendor-tuned library implementations.","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126368561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}