2015 IEEE International Conference on Cluster Computing最新文献_第3页

A Workload-Aware Energy Model for Virtual Machine Migration 虚拟机迁移的工作负载感知能量模型

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.47

Vincenzo De Maio, G. Kecskeméti, R. Prodan

引用次数: 21

Development of MapReduce and MPI Programs for Motif Search Motif搜索MapReduce和MPI程序的开发

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.82

Mejdl S. Safran, Saad Al-qahtani, Michelle Zhu, D. Che

引用次数: 0

I/O-Aware Batch Scheduling for Petascale Computing Systems 千兆级计算系统的I/ o感知批调度

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.45

Zhou Zhou, Xu Yang, Dongfang Zhao, P. Rich, Wei Tang, Jia Wang, Z. Lan

{"title":"I/O-Aware Batch Scheduling for Petascale Computing Systems","authors":"Zhou Zhou, Xu Yang, Dongfang Zhao, P. Rich, Wei Tang, Jia Wang, Z. Lan","doi":"10.1109/CLUSTER.2015.45","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.45","url":null,"abstract":"In the Big Data era, the gap between the storage performance and an application's I/O requirement is increasing. I/O congestion caused by concurrent storage accesses from multiple applications is inevitable and severely harms the performance. Conventional approaches either focus on optimizing an application's access pattern individually or handle I/O requests on a low-level storage layer without any knowledge from the upper-level applications. In this paper, we present a novel I/O-aware batch scheduling framework to coordinate ongoing I/O requests on petascale computing systems. The motivation behind this innovation is that the batch scheduler has a holistic view of both the system state and jobs' activities and can control the jobs' status on the fly during their execution. We treat a job's I/O requests as periodical subjobs within its lifecycle and transform the I/O congestion issue into a classical scheduling problem. We design two scheduling polices with different scheduling objectives either on user-oriented metrics or system performance. We conduct extensive trace-based simulations using real job traces and I/O traces from a production IBM Blue Gene/Q system. Experimental results demonstrate that our design can improve job performance by more than 30%, as well as increasing system performance.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133751156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Building Bridges from the Campus to XSEDE 搭建从校园到XSEDE的桥梁

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.144

Liming Lee, Ian T Foster, S. Tuecke

引用次数: 1

monBench: A Database Performance Benchmark for Cloud Monitoring System monBench:用于云监控系统的数据库性能基准

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.94

Xinkui Zhao, Jianwei Yin, Chen Zhi, Pengxiang Lin, Shichun Feng, Hao Wu, Zuoning Chen

引用次数: 1

Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation 通过多元插值检测和纠正模板应用中的数据损坏

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.108

L. Bautista-Gomez, F. Cappello

{"title":"Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation","authors":"L. Bautista-Gomez, F. Cappello","doi":"10.1109/CLUSTER.2015.108","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.108","url":null,"abstract":"High-performance computing is a powerful tool that allows scientists to study complex natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher performance compared with that of current systems. However, power constrains in future exascale systems might limit the level of resilience of those machines. In particular, data could get corrupted silently, that is, without the hardware detecting the corruption. This situation is clearly unacceptable: simulation results must be within the error margin specified by the user. In this paper, we exploit multivariate interpolation in order to detect and correct data corruption in stencil applications. We evaluate this technique with a turbulent fluid application, and we demonstrate that the prediction error using multivariate interpolation is on the order of 0.01. Our results show that this mechanism can detect and correct most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute. In addition, we stress test the detector by injecting more than ten corruptions per minute and observe that our strategy allows the application to produce results with an error deviation under 10% in such a stressful scenario.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"1947 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124912132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Analysis of XDMoD/SUPReMM Data Using Machine Learning Techniques 使用机器学习技术分析XDMoD/SUPReMM数据

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.114

S. Gallo, Joseph P. White, R. L. Deleon, T. Furlani, Helen Ngo, A. Patra, Matthew D. Jones, Jeffrey T. Palmer, N. Simakov, Jeanette M. Sperhac, Martins D. Innus, Thomas Yearke, Ryan Rathsam

引用次数: 11

Taming Non-local Stragglers Using Efficient Prefetching in MapReduce MapReduce中使用高效预取控制非本地掉队者

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.16

Ze Yu, Min Li, Xin Yang, Han Zhao, Xiaolin Li

{"title":"Taming Non-local Stragglers Using Efficient Prefetching in MapReduce","authors":"Ze Yu, Min Li, Xin Yang, Han Zhao, Xiaolin Li","doi":"10.1109/CLUSTER.2015.16","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.16","url":null,"abstract":"MapReduce has been widely adopted as a programming model to process big data. However, parallel jobs in MapReduce are prone to be plagued by stragglers caused by non-local tasks for two reasons: first, system logs from production clusters show that a non-local task can be two times slower than a local task; second, a job's completion time is bottlenecked by its slowest parallel tasks. As a result, even one single non-local task can become the straggler of the whole job, causing significant delay of the whole job. In this paper, we propose to alleviate this problem by proactively prefetching input data for non-local tasks. However, performing such prefetching efficiently in MapReduce is difficult, because it requires both application-level information to generate accurate prefetching requests at runtime, and an appropriate network flow scheduling mechanism to guarantee the timeliness of prefetching flows. To address these challenges, we design and implement FlexFetch, which 1) leverages a novel mechanism called speculative scheduling to accurately generate prefetching flows, 2) explicitly allocates network resources to prefetching flows using a criticality-aware deadline-driven flow scheduling algorithm. We evaluate FlexFetch through both testbed experiments and large-scale simulations using production workloads. The results show that FlexFetch reduces the completion time by 41.8% for small jobs and 26.8% on average, compared with the default MapReduce implementation in Hadoop.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124642289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Modeling a Large Data-Acquisition Network in a Simulation Framework 基于仿真框架的大型数据采集网络建模

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.137

T. Colombo, H. Fröning, P. García, W. Vandelli

{"title":"Modeling a Large Data-Acquisition Network in a Simulation Framework","authors":"T. Colombo, H. Fröning, P. García, W. Vandelli","doi":"10.1109/CLUSTER.2015.137","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.137","url":null,"abstract":"The ATLAS detector at CERN records particle collision \"events\" delivered by the Large Hadron Collider. Its data-acquisition system identifies, selects, and stores interesting events in near real-time, with an aggregate throughput of several 10 GB/s. It is a distributed software system executed on a farm of roughly 2000 commodity worker nodes communicating via TCP/IP on an Ethernet network. Event data fragments are received from the many detector readout channels and are buffered, collected together, analyzed and either stored permanently or discarded. This system, and data-acquisition systems in general, are sensitive to the latency of the data transfer from the readout buffers to the worker nodes. Challenges affecting this transfer include the many-to-one communication pattern and the inherently bursty nature of the traffic. In this paper we introduce the main performance issues brought about by this workload, focusing in particular on the so-called TCP incast pathology. Since performing systematic studies of these issues is often impeded by operational constraints related to the mission-critical nature of these systems, we focus instead on the development of a simulation model of the ATLAS data-acquisition system, used as a case study. The simulation is based on the well-established OMNeT++ framework. Its results are compared with existing measurements of the system's behavior. The successful reproduction of the measurements by the simulations validates the modeling approach. We share some of the preliminary findings obtained from the simulation, as an example of the additional possibilities it enables, and outline the planned future investigations.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115997240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fast Fault Injection and Sensitivity Analysis for Collective Communications 集体通信快速故障注入与灵敏度分析

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI: 10.1109/CLUSTER.2015.31

Kun Feng, Manjunath Gorentla Venkata, Dong Li, Xian-He Sun

{"title":"Fast Fault Injection and Sensitivity Analysis for Collective Communications","authors":"Kun Feng, Manjunath Gorentla Venkata, Dong Li, Xian-He Sun","doi":"10.1109/CLUSTER.2015.31","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.31","url":null,"abstract":"The collective communication operations, which are widely used in parallel applications for global communication and synchronization are critical for application's performance and scalability. However, how faulty collective communications impact the application and how errors propagate between the application processes is largely unexplored. One of the critical reasons for this situation is the lack of fast evaluation method to investigate the impacts of faulty collective operations. The traditional random fault injection methods relying on a large amount of fault injection tests to ensure statistical significance require a significant amount of resources and time. These methods result in prohibitive evaluation cost when applied to the collectives. In this paper, we introduce a novel tool named Fast Fault Injection and Sensitivity Analysis Tool (FastFIT) to conduct fast fault injection and characterize the application sensitivity to faulty collectives. The tool achieves fast exploration by reducing the exploration space and predicting the application sensitivity using Machine Learning (ML) techniques. A basis for these techniques are implicit correlations between MPI semantics, application context, critical application features, and application responses to faulty collective communications. The experimental results show that our approach reduces the fault injection points and tests by 97% for representative benchmarks (NAS Parallel Benchmarks (NPB)) and a realistic application (Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)) on a production supercomputer. Further, we statistically generalize the application sensitivity to faulty collective communications for these workloads, and present correlation between application features and the sensitivity.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116297197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3