Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)最新文献

筛选
英文 中文
Efficient operating system scheduling for performance-asymmetric multi-core architectures 针对性能不对称多核架构的高效操作系统调度
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362694
Tong Li, Dan P. Baumberger, David A. Koufaty, Scott Hahn
{"title":"Efficient operating system scheduling for performance-asymmetric multi-core architectures","authors":"Tong Li, Dan P. Baumberger, David A. Koufaty, Scott Hahn","doi":"10.1145/1362622.1362694","DOIUrl":"https://doi.org/10.1145/1362622.1362694","url":null,"abstract":"Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges to operating systems, which traditionally assume homogeneous hardware. This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures. AMPS contains three components: asymmetry-aware load balancing, faster-core-first scheduling, and NUMA-aware migration. We have implemented AMPS in Linux kernel 2.6.16 and used CPU clock modulation to emulate performance asymmetry on an SMP and NUMA system. For various workloads, we show that AMPS achieves a median speedup of 1.16 with a maximum of 1.44 over stock Linux on the SMP, and a median of 1.07 with a maximum of 2.61 on the NUMA system. Our results also show that AMPS improves fairness and repeatability of application performance measurements.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132227730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 264
Falkon: a Fast and Light-weight tasK executiON framework Falkon:一个快速轻量级的任务执行框架
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362680
I. Raicu, Yong Zhao, C. Dumitrescu, Ian T Foster, M. Wilde
{"title":"Falkon: a Fast and Light-weight tasK executiON framework","authors":"I. Raicu, Yong Zhao, C. Dumitrescu, Ian T Foster, M. Wilde","doi":"10.1145/1362622.1362680","DOIUrl":"https://doi.org/10.1145/1362622.1362680","url":null,"abstract":"To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that Falkon throughput (487 tasks/sec) and scalability (to 54,000 executors and 2,000,000 tasks processed in just 112 minutes) are one to two orders of magnitude better than other systems used in production Grids. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114235275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 380
User-friendly and reliable grid computing based on imperfect middleware 基于不完善中间件的用户友好可靠的网格计算
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362668
R. V. Nieuwpoort, T. Kielmann, H. Bal
{"title":"User-friendly and reliable grid computing based on imperfect middleware","authors":"R. V. Nieuwpoort, T. Kielmann, H. Bal","doi":"10.1145/1362622.1362668","DOIUrl":"https://doi.org/10.1145/1362622.1362668","url":null,"abstract":"Writing grid applications is hard. First, interfaces to existing grid middleware often are too low-level for application programmers who are domain experts rather than computer scientists. Second, grid APIs tend to evolve too quickly for applications to follow. Third, failures and configuration incompatibilities require applications to use different solutions to the same problem, depending on the actual sites in use. This paper describes the Java Grid Application Toolkit (Java-GAT) that provides a high-level, middleware-independent and site-independent interface to the grid. The JavaGAT uses nested exceptions and intelligent dispatching of method invocations to handle errors and to automatically select suitable grid middleware implementations for requested operations. The JavaGAT's adaptor writing framework simplifies the implementation of interfaces to new middleware releases by combining nested exceptions and intelligent dispatching with rich default functionality. The many applications and middleware adaptors that have been provided by third-party developers indicate the viability of our approach.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114592203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Using MPI file caching to improve parallel write performance for large-scale scientific applications 使用MPI文件缓存提高大规模科学应用程序的并行写性能
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362634
W. Liao, A. Ching, Kenin Coloma, Arifa Nisar, A. Choudhary, Jacqueline H. Chen, R. Sankaran, S. Klasky
{"title":"Using MPI file caching to improve parallel write performance for large-scale scientific applications","authors":"W. Liao, A. Ching, Kenin Coloma, Arifa Nisar, A. Choudhary, Jacqueline H. Chen, R. Sankaran, S. Klasky","doi":"10.1145/1362622.1362634","DOIUrl":"https://doi.org/10.1145/1362622.1362634","url":null,"abstract":"Typical large-scale scientific applications periodically write checkpoint files to save the computational state throughout execution. Existing parallel file systems improve such write-only I/O patterns through the use of client-side file caching and write-behind strategies. In distributed environments where files are rarely accessed by more than one client concurrently, file caching has achieved significant success; however, in parallel applications where multiple clients manipulate a shared file, cache coherence control can serialize I/O. We have designed a thread based caching layer for the MPI I/O library, which adds a portable caching system closer to user applications so more information about the application's I/O patterns is available for better coherence control. We demonstrate the impact of our caching solution on parallel write performance with a comprehensive evaluation that includes a set of widely used I/O benchmarks and production application I/O kernels.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122555083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Parallel hierarchical visualization of large time-varying 3D vector fields 大型时变三维矢量场的并行分层可视化
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362655
Hongfeng Yu, Chaoli Wang, K. Ma
{"title":"Parallel hierarchical visualization of large time-varying 3D vector fields","authors":"Hongfeng Yu, Chaoli Wang, K. Ma","doi":"10.1145/1362622.1362655","DOIUrl":"https://doi.org/10.1145/1362622.1362655","url":null,"abstract":"We present the design of a scalable parallel pathline construction method for visualizing large time-varying 3D vector fields. A 4D (i.e., time and the 3D spatial domain) representation of the vector field is introduced to make a time-accurate depiction of the flow field. This representation also allows us to obtain pathlines through streamline tracing in the 4D space. Furthermore, a hierarchical representation of the 4D vector field, constructed by clustering the 4D field, makes possible interactive visualization of the flow field at different levels of abstraction. Based on this hierarchical representation, a data partitioning scheme is designed to achieve high parallel efficiency. We demonstrate the performance of parallel pathline visualization using data sets obtained from terascale flow simulations. This new capability will enable scientists to study their time-varying vector fields at the resolution and interactivity previously unavailable to them.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129761054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors 评估网卡硬件需求,以实现多核处理器上的高消息速率PGAS支持
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362671
K. Underwood, M. Levenhagen, R. Brightwell
{"title":"Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors","authors":"K. Underwood, M. Levenhagen, R. Brightwell","doi":"10.1145/1362622.1362671","DOIUrl":"https://doi.org/10.1145/1362622.1362671","url":null,"abstract":"Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the network interface hardware to enable efficient execution. In the past, Cray included E-registers on the Cray T3E to support the SHMEM API; however, with the advent of multi-core processors, the balance of computation to communication capabilities has shifted toward computation. This paper explores the message rates that are achievable with multi-core processors and simplified PGAS support on a more conventional network interface. For message rate tests, we find that simple network interface hardware is more than sufficient. We also find that even typical data distributions, such as cyclic or block-cyclic, do not need specialized hardware support. Finally, we assess the impact of such support on the well known RandomAccess benchmark.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129527790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A case for low-complexity MP architectures 低复杂度MP架构的案例
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362648
Håkan Zeffer, Erik Hagersten
{"title":"A case for low-complexity MP architectures","authors":"Håkan Zeffer, Erik Hagersten","doi":"10.1145/1362622.1362648","DOIUrl":"https://doi.org/10.1145/1362622.1362648","url":null,"abstract":"Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software. We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs. Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122021912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
The ghost in the machine: observing the effects of kernel operation on parallel application performance 机器中的幽灵:观察内核操作对并行应用程序性能的影响
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362662
A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckman
{"title":"The ghost in the machine: observing the effects of kernel operation on parallel application performance","authors":"A. Nataraj, A. Morris, A. Malony, M. Sottile, P. Beckman","doi":"10.1145/1362622.1362662","DOIUrl":"https://doi.org/10.1145/1362622.1362662","url":null,"abstract":"The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code are system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, the measurement of OS kernel activities is key. We describe a technology to observe kernel actions and make this information available to application-level performance measurement tools. The benefits of merged application and OS performance information and its use in parallel performance analysis are demonstrated, both for profiling and tracing methodologies. In particular, we focus on the problem of kernel noise assessment as a stress test of the approach. We show new results for characterizing noise and introduce new techniques for evaluating noise interference and its effects on application execution. Our kernel measurement and noise analysis technologies are being developed as part of Linux OS environments for scalable parallel systems.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127933588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Multi-threading and one-sided communication in parallel LU factorization 并行逻辑分解中的多线程和单向通信
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362664
P. Husbands, K. Yelick
{"title":"Multi-threading and one-sided communication in parallel LU factorization","authors":"P. Husbands, K. Yelick","doi":"10.1145/1362622.1362664","DOIUrl":"https://doi.org/10.1145/1362622.1362664","url":null,"abstract":"Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has non-trivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memory-constrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other state-of-the-art MPI codes.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131277118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Automatic software interference detection in parallel applications 并行应用中的自动软件干扰检测
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) Pub Date : 2007-11-16 DOI: 10.1145/1362622.1362642
V. Tabatabaee, J. Hollingsworth
{"title":"Automatic software interference detection in parallel applications","authors":"V. Tabatabaee, J. Hollingsworth","doi":"10.1145/1362622.1362642","DOIUrl":"https://doi.org/10.1145/1362622.1362642","url":null,"abstract":"We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result in performance degradation. Our goal is to provide a reliable metric for software interference that can be used in soft-failure protection and recovery systems. A unique feature of our algorithm is that we measure the relative timing of application events (i.e. time between MPI calls) rather than system level events such as CPU utilization. This approach lets our system automatically accommodate natural variations in an application's utilization of resources. We use performance irregularities and degradation as signs of software interference. However, instead of relying on temporal changes in performance, our system detects spatial performance degradation across multiple processors. We also include a case study that demonstrates our technique's effectiveness, resilience and robustness.","PeriodicalId":274744,"journal":{"name":"Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)","volume":"41 23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128492179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信