2005 IEEE International Conference on Cluster Computing最新文献_第3页

Exploiting NIC Memory for Improving Cluster-Based Webserver Performance 利用网卡内存提高基于集群的web服务器性能

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347067

G. S. Choi, Jin-Ha Kim, D. Ersoz, Mazin S. Yousif, C. Das

引用次数: 2

The SMASH Impacts to Cluster Computing SMASH对集群计算的影响

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347081

Yung-Chin Fang, J. Hsieh

{"title":"The SMASH Impacts to Cluster Computing","authors":"Yung-Chin Fang, J. Hsieh","doi":"10.1109/CLUSTR.2005.347081","DOIUrl":"https://doi.org/10.1109/CLUSTR.2005.347081","url":null,"abstract":"Summary form only given. High performance computing clusters scaling out fact indicates manageability will become more important than ever. Over time, a computer center tends facilitate multiple management frameworks from vendors to remote manage generations of heterogeneous HPC clusters to complete one task. The heterogeneous and scaling out computing info structure made HPCC/grid administration even more challenging and time consuming than before. Management interoperability is usually compromised or absent due to the heterogeneous environment. In order to solve this problem for the long run and further reduce the total cost of ownership, industry is defining the systems management architecture for server hardware (SMASH) initiative. The SMASH initiative is a suite of specifications, which standardize management interfaces and remote management architecture for heterogeneous computing environments. The suite of specifications includes unified command line protocol, resource discovery, and resource addressing and data model profiles. SMASH not only addresses complicated administration challenges as well as enables hardware independent remote manageability plus computing info structure status/performance aware job scheduling schemes and as a result, will bring HPC clusters/grid utilization rates to an even higher level. This poster uses figures to illustrate the challenges, corresponding SMASH specifications and point out the potential research directions in supercomputing space over SMASH implementations","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133508346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments 最小化循环收集集群环境中检查点的网络开销

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347074

Daniel Nurmi, J. Brevik, R. Wolski

{"title":"Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments","authors":"Daniel Nurmi, J. Brevik, R. Wolski","doi":"10.1109/CLUSTR.2005.347074","DOIUrl":"https://doi.org/10.1109/CLUSTR.2005.347074","url":null,"abstract":"Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkpointed before the desktop resources are reclaimed by their owners and the job is evacuated. In this paper, we investigate a new system for computing efficient checkpoint schedules in cycle-harvesting environments. Our system records the historical availability from each resource and fits a statistical model to the observations. Because checkpointing must often traverse the network (i.e. the desktop hosts do not provide sufficient persistent storage for checkpoints), we combine this model with predictions of network performance to the storage site to compute a checkpoint schedule. When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application's execution, evaluates the expected time and network overhead as a function of the checkpoint interval, and numerically optimizes with respect to time. We report on the performance of and implementation of this system using the Condor cycle-harvesting environment at the University of Wisconsin. We also evaluate the efficiencies we achieve for a variety of network overheads using trace-based simulation. Finally, we validate our simulations against the observed performance with Condor. Our results indicate that while the choice of model distribution has a relatively small but positive effect on time efficiency, it has a substantial impact on network utilization","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117119310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Modeling Protocol Offload for Message-oriented Communication 面向消息通信的建模协议卸载

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347069

Patricia Gilfeather, A. Maccabe

引用次数: 17

Efficient and Robust Computation of Resource Clusters in the Internet 互联网中资源集群的高效鲁棒计算

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347046

Chuang Liu, Ian T Foster

引用次数: 1

An integrated Retrieval and Pre-fetching algorithms for Segmented Streaming in Mobile Peer-to-Peer Networks 移动点对点网络中分段流的集成检索和预取算法

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347094

Zhou Su, J. Katto, Y. Yasuda

引用次数: 2

Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off 可靠性感知检查点/重启方案:性能权衡

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347058

Yudan Liu, C. Leangsuksun, Hertong Song, S. Scott

{"title":"Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off","authors":"Yudan Liu, C. Leangsuksun, Hertong Song, S. Scott","doi":"10.1109/CLUSTR.2005.347058","DOIUrl":"https://doi.org/10.1109/CLUSTR.2005.347058","url":null,"abstract":"In previous years, large scale clusters have been commonly deployed to solve important grand-challenge scientific problems. In order to reduce computational time, the system size has been increasingly expanded. Unfortunately, the reliability of such cluster systems goes in the opposite direction, as the extension of a system scale. Since failures of a single node could result in a system outage, it is essential to effectively deal with faulty situations in the grand challenge problem-solving environment. Checkpointing is one of common fault tolerance techniques. However, there are many challenges in checkpointing such as overhead, latency and consistency, as well as recovery. In this paper, a reliability-aware checkpoint/restart method was introduced. It is a novel technique to consider checkpointing placement based on system reliability. We constructed a cost model and derived an optimal checkpoint placement function based on failure rates: A trade-off between performance and reliability (i.e. performability) was a key consideration. We also implemented a proof-of-concept and demonstrated improvements resulting from our techniques for fault-tolerant MPI applications on an HA-OSCAR cluster","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131706875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Implementation and Performance of Portals 3.3 on the Cray XT3 门户3.3在Cray XT3上的实现和性能

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347061

R. Brightwell, Trammell Hudson, K. Pedretti, R. Riesen, K. Underwood

引用次数: 35

Meaningful Automated Statistical Analysis of Large Computational Clusters 大型计算集群有意义的自动统计分析

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347090

J. Brandt, A. Gentile, Y. Marzouk, P. Pébay

{"title":"Meaningful Automated Statistical Analysis of Large Computational Clusters","authors":"J. Brandt, A. Gentile, Y. Marzouk, P. Pébay","doi":"10.1109/CLUSTR.2005.347090","DOIUrl":"https://doi.org/10.1109/CLUSTR.2005.347090","url":null,"abstract":"As clusters utilizing commercial off-the-shelf technology have grown from tens to thousands of nodes and typical job sizes have likewise increased, much effort has been devoted to improving the scalability of message-passing fabrics, schedulers, and storage. Largely ignored, however, has been the issue of predicting node failure, which also has a large impact on scalability. In fact, more than ten years into cluster computing, we are still managing this issue on a node-by-node basis even though available diagnostic data has grown immensely. We have built a tool that uses the statistical similarity of the large number of nodes in a cluster to infer the health of each individual node. In the poster, we first present real data and statistical calculations as foundational material and justification for our claims of similarity. Next we present our methodology and its implications for early notification of deviation from normal behavior, problem diagnosis, automatic code restart via interaction with scheduler, and airflow distribution monitoring in the machine room. A framework addressing scalability is discussed briefly. Lastly, we present case studies showing how our methodology has been used to detect aberrant nodes whose deviations are still far below the detection level of traditional methods. A summary of the results of the case studies appears below","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126231617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device 通过InfiniBand交换到远端内存:一种使用高性能网络块设备的方法

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI: 10.1109/CLUSTR.2005.347050

Shuang Liang, R. Noronha, D. Panda

{"title":"Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device","authors":"Shuang Liang, R. Noronha, D. Panda","doi":"10.1109/CLUSTR.2005.347050","DOIUrl":"https://doi.org/10.1109/CLUSTR.2005.347050","url":null,"abstract":"Traditionally, operations with memory on other nodes (remote memory) in cluster environments interconnected with technologies like Gigabit Ethernet have been expensive with latencies several magnitudes slower than local memory accesses. Modern RDMA capable networks such as InfiniBand and Quadrics provide low latency of a few microseconds and high bandwidth of up to 10 Gbps. This has significantly reduced the latency gap between access to local memory and remote memory in modern clusters. Remote idle memory can be exploited to reduce the memory pressure on individual nodes. This is akin to adding an additional level in the memory hierarchy between local memory and the disk, with potentially dramatic performance improvements especially for memory intensive applications. In this paper, we take on the challenge to design a remote paging system for remote memory utilization in InfiniBand clusters. We present the design and implementation of a high performance networking block device (HPBD) over InfiniBand fabric, which serves as a swap device of kernel virtual memory (VM) system for efficient page transfer to/from remote memory servers. Our experiments show that using HPBD, quick sort performs only 1.45 times slower than local memory system, and up to 21 times faster than local disk. And our design is completely transparent to user applications. To the best of our knowledge, it is the first work of a remote pager design using InfiniBand for remote memory utilization","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"3 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127466732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 116