2011 IEEE International Conference on Cluster Computing最新文献_第6页

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.49

S. Crago, K. Dunn, Patrick Eads, L. Hochstein, D. Kang, Mikyung Kang, Devendra Modium, Karandeep Singh, Jinwoo Suh, J. Walters

引用次数: 115

Scalability of Semi-implicit Time Integrators for Nonhydrostatic Galerkin-Based Atmospheric Models on Large Scale Clusters 非流体静力galerkin大气模型半隐式时间积分的可扩展性

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.70

J. Kelly, F. Giraldo, G. Jost

引用次数: 4

Data Partitioning on Heterogeneous Multicore Platforms 异构多核平台的数据分区

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.64

Ziming Zhong, V. Rychkov, Alexey L. Lastovetsky

引用次数: 25

Incorporating Network RAM and Flash into Fast Backing Store for Clusters 将网络RAM和闪存集成到集群的快速备份存储中

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.22

T. Newhall, Doug Woos

{"title":"Incorporating Network RAM and Flash into Fast Backing Store for Clusters","authors":"T. Newhall, Doug Woos","doi":"10.1109/CLUSTER.2011.22","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.22","url":null,"abstract":"We present Nswap2L, a fast backing storage system for general purpose clusters. Nswap2L implements a single device interface on top of multiple heterogeneous physical storage devices, particularly targeting fast random access devices such as Network RAM and flash SSDs. A key design feature of Nswap2L is the separation of the interface from the underlying physical storage, data that are read and written to our ``device\" are managed by our underlying system and may be stored in local RAM, remote RAM, flash, local disk or any other cluster-wide storage. Nswap2L chooses which physical device will store data based on cluster resource usage and the characteristics of various storage media. In addition, it migrates data from one physical device to another in response to changes in capacity and to take advantage of the strengths of different types of physical media, such as fast writes over the network and fast reads from flash. Performance results of our prototype implementation of Nswap2L added as a swap device on a 12 node Linux cluster show speed-ups of over 30 times versus swapping to disk and over 1.7 times versus swapping to flash. In addition, we show that for parallel benchmarks, Nswap2L using Network RAM and a flash device that is slower than Network RAM can perform better than Network RAM alone.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127385584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CULZSS: LZSS Lossless Data Compression on CUDA CULZSS:基于CUDA的LZSS无损数据压缩

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.52

Adnan Ozsoy, D. M. Swany

{"title":"CULZSS: LZSS Lossless Data Compression on CUDA","authors":"Adnan Ozsoy, D. M. Swany","doi":"10.1109/CLUSTER.2011.52","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.52","url":null,"abstract":"Increasing needs in efficient storage management and better utilization of network bandwidth with less data transfer have led the computing community to consider data compression as a solution. However, compression introduces extra overhead and performance can suffer. The key elements in making the decision to use compression are execution time and compression ratio. Due to negative performance impact, compression is often neglected. General purpose computing on graphic processing units (GPUs) introduces new opportunities where parallelism is available. Our work targets the use of opportunities in GPU based systems by exploiting parallelism in compression algorithms. In this paper we present an implementation of the Lempel-Ziv-Storer-Szymanski (LZSS) loss less data compression algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. Our implementation of the LZSS algorithm on GPUs significantly improves the performance of the compression process compared to CPU based implementation without any loss in compression ratio. This can support GPU based clusters in solving application bandwidth problems. Our system outperforms the serial CPU LZSS implementation by up to 18x, the parallel threaded version up to 3x and the BZIP2 program by up to 6x in terms of compression time, showing the promise of CUDA systems in loss less data compression. To give the programmers an easy to use tool, our work also provides an API for in memory compression without the need for reading from and writing to files, in addition to the version involving I/O.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115615026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

BMF: Bitmapped Mass Fingerprinting for Fast Protein Identification 快速蛋白质鉴定的位图质量指纹图谱

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.11

Weikuan Yu, K. Wu, Wei-Shinn Ku, Cong Xu, Juan Gao

{"title":"BMF: Bitmapped Mass Fingerprinting for Fast Protein Identification","authors":"Weikuan Yu, K. Wu, Wei-Shinn Ku, Cong Xu, Juan Gao","doi":"10.1109/CLUSTER.2011.11","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.11","url":null,"abstract":"Protein identification is an important objective for proteomic and medical sciences as well as for pharmaceutical industry. With recent large-scale automation of genome sequencing and the explosion of protein databases, it is important to exploit latest data processing technologies and design highly scalable algorithms to speed up protein identification. In this study, we design, implement, and evaluate a new software tool, Bitmapped Mass Fingerprinting (BMF), that can efficiently construct a bitmap index for short peptides, and quickly identify candidate proteins from leading protein databases. BMF is developed by integrating the Fast Bit indexing technology and the popular Message Passing Interface (MPI) for parallelization. By exploiting Fast Bit for peptide mass fingerprinting across protein boundaries, we are able to accomplish parallel computation and I/O for a scalable implementation of protein identification. Our experimental results show that BMF brings dramatic performance improvement for protein identification from various protein databases. In particular, we demonstrate that BMF can effectively scale up to 8,192 cores on the Jaguar Supercomputer at Oak Ridge National Laboratory, achieving superb performance in identifying proteins from the National Center for Biotechnology Information (NCBI) non-redundant (NR) protein database.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125616931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Implementing High Performance Remote Method Invocation in CCA 在CCA中实现高性能远程方法调用

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.78

Jian Yin, Khushbu Agarwal, M. Krishnan, D. Chavarría-Miranda, I. Gorton, T. Epperly

{"title":"Implementing High Performance Remote Method Invocation in CCA","authors":"Jian Yin, Khushbu Agarwal, M. Krishnan, D. Chavarría-Miranda, I. Gorton, T. Epperly","doi":"10.1109/CLUSTER.2011.78","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.78","url":null,"abstract":"We report our effort in engineering a high performance remote method invocation (RMI) mechanism for the Common Component Architecture (CCA). This mechanism provides a highly efficient and easy-to-use mechanism for distributed computing in CCA, enabling CCA applications to effectively leverage parallel systems to accelerate computations. This work is built on the previous work of Babel RMI. Babel is a high performance language interoperability tool that is used in CCA for scientific application writers to share, reuse, and compose applications from software components written in different programming languages. Babel provides a transparent and flexible RMI framework for distributed computing. However, the existing Babel RMI implementation is built on top of TCP and does not provide the level of performance required to distribute fine-grained tasks. We observed that the main reason the TCP based RMI does not perform well is because it does not utilize the high performance interconnect hardware on a cluster efficiently. We have implemented a high performance RMI protocol, HPCRMI. HPCRMI achieves low latency by building on top of a low-level portable communication library, Aggregated Remote Message Copy Interface (ARMCI), and minimizing communication for each RMI call. Our design allows a RMI operation to be completed by only two RDMA operations. We also aggressively optimize our system to reduce copying. In this paper, we discuss the design and our experimental evaluation of this protocol. Our experimental results show that our protocol can improve RMI performance by an order of magnitude.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129246885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

High Performance Dense Linear System Solver with Soft Error Resilience 具有软误差弹性的高性能密集线性系统求解器

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.38

Peng Du, P. Luszczek, J. Dongarra

{"title":"High Performance Dense Linear System Solver with Soft Error Resilience","authors":"Peng Du, P. Luszczek, J. Dongarra","doi":"10.1109/CLUSTER.2011.38","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.38","url":null,"abstract":"As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131798689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Energy Templates: Exploiting Application Information to Save Energy 能源模板:利用应用信息来节约能源

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.33

D. Kerbyson, Abhinav Vishnu, K. Barker

{"title":"Energy Templates: Exploiting Application Information to Save Energy","authors":"D. Kerbyson, Abhinav Vishnu, K. Barker","doi":"10.1109/CLUSTER.2011.33","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.33","url":null,"abstract":"In this work we consider a novel application centric approach for saving energy on large-scale parallel systems. By using a priori information on the expected application behavior we identify points at which processor-cores will wait for incoming data and thus may be placed in a low power state to save energy. The approach is general and complements many of the existing approaches that rely on saving energy at points of global synchronization. We capture the expected application behavior into an Energy Template whose purpose is to identify when cores are expected to be in an idle state and allow the runtime to use the template information and change the power state of the core. We prototype an Energy Template for a wave front algorithm that contains an complex processing pattern in which cores wait for incoming data before processing local data and whose wait-time varies from phase to phase. The implementation uses PMPI and requires minimal changes to the application code. Using a power instrumented cluster we demonstrate that using an Energy Template for the wave front application lowers the power requirements by 8% when using 216 cores, from the system maximum of 23%, and the energy requirements by 4%. We also show that the wave front's inherent parallel activity will lead to increased savings on larger systems.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115460877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Experience on Comparison of Operating Systems Scalability on the Multi-core Architecture 多核架构下操作系统可扩展性比较的经验

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.31

Yan Cui, Yingxin Wang, Yu Chen, Yuanchun Shi

{"title":"Experience on Comparison of Operating Systems Scalability on the Multi-core Architecture","authors":"Yan Cui, Yingxin Wang, Yu Chen, Yuanchun Shi","doi":"10.1109/CLUSTER.2011.31","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.31","url":null,"abstract":"Multi-core processor architectures have become ubiquitous in today's computing platforms, especially in parallel computing installations, with their power and cost advantages. While the technology trend continues towards having hundreds of cores on a chip in the foreseeable future, an urgent question posed to system designers as well as application users is whether applications can receive sufficient support on today's operating systems for them to scale to many cores. To this end, people need to understand the strengths and weaknesses on their support on scalability and to identify major bottlenecks limiting the scalability, if any. As open-source operating systems are of particular interests in the research and industry communities, in this paper we choose three operating systems (Linux, Solaris and FreeBSD) to systematically evaluate and compare their scalability by using a set of highly-focused micro benchmarks for broad and detailed understanding their scalability on an AMD 32-core system. We use system profiling tools and analyze kernel source codes to find out the root cause of each observed scalability bottleneck. Our results reveal that there is no single operating system among the three standing out on all system aspects, though some system(s) can prevail on some of the system aspects. For example, Linux outperforms Solaris and FreeBSD significantly for file-descriptor- and process-intensive operations. For applications with intensive sockets creation and deletion operations, Solaris leads FreeBSD, which scales better than Linux. With the help of performance tools and source code instrumentation and analysis, we find that synchronization primitives protecting shared data structures in the kernels are the major bottleneck limiting system scalability. Empowered by the knowledge obtained through targeted experiments and analysis on a small-scale system, we are able to project the scalability of an application on any of the investigated operating systems running on a system of a larger number of cores.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116663225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8