2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
NVIDIA Deep Learning Tutorial NVIDIA深度学习教程
J. Bernauer
{"title":"NVIDIA Deep Learning Tutorial","authors":"J. Bernauer","doi":"10.1109/IPDPS.2017.7","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.7","url":null,"abstract":"Learn how hardware and software stacks enable not only quick prototyping, but also efficient large-scale production deployments. The tutorial will conclude with a discussion about hands-on deep learning training opportunities as well as free academic teaching materials and GPU cloud platforms for university faculty.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132925360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Container-Based Cloud Platform for Mobile Computation Offloading 基于容器的移动计算卸载云平台
Song Wu, Chao Niu, J. Rao, Hai Jin, Xiaohai Dai
{"title":"Container-Based Cloud Platform for Mobile Computation Offloading","authors":"Song Wu, Chao Niu, J. Rao, Hai Jin, Xiaohai Dai","doi":"10.1109/IPDPS.2017.47","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.47","url":null,"abstract":"With the explosive growth of smartphones and cloud computing, mobile cloud, which leverages cloud resource to boost the performance of mobile applications, becomes attrac- tive. Many efforts have been made to improve the performance and reduce energy consumption of mobile devices by offloading computational codes to the cloud. However, the offloading cost caused by the cloud platform has been ignored for many years. In this paper, we propose Rattrap, a lightweight cloud platform which improves the offloading performance from cloud side. To achieve such goals, we analyze the characteristics of typical of- floading workloads and design our platform solution accordingly. Rattrap develops a new runtime environment, Cloud Android Container, for mobile computation offloading, replacing heavy- weight virtual machines (VMs). Our design exploits the idea of running operating systems with differential kernel features inside containers with driver extensions, which partially breaks the limitation of OS-level virtualization. With proposed resource sharing and code cache mechanism, Rattrap fundamentally improves the offloading performance. Our evaluation shows that Rattrap not only reduces the startup time of runtime environments and shows an average speedup of 16x, but also saves a large amount of system resources such as 75% memory footprint and at least 79% disk capacity. Moreover, Rattrap improves offloading response by as high as 63% over the cloud platform based on VM, and thus saving the battery life.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132973209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Efficient and Deterministic Scheduling for Parallel State Machine Replication 并行状态机复制的高效和确定性调度
O. Mendizabal, R. S. D. Moura, F. Dotti, F. Pedone
{"title":"Efficient and Deterministic Scheduling for Parallel State Machine Replication","authors":"O. Mendizabal, R. S. D. Moura, F. Dotti, F. Pedone","doi":"10.1109/IPDPS.2017.29","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.29","url":null,"abstract":"Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high availability and strong consistency. To boost the performance of state machine replication, recent proposals have introduced parallel execution of commands. In parallel state machine replication, incoming commands may or may not depend on other commands that are waiting for execution. Although dependent commands must be processed in the same relative order at every replica to avoid inconsistencies, independent commands can be executed in parallel and benefit from multi-core architectures. Since many application workloads are mostly composed of independent commands, these parallel models promise high throughput without sacrificing strong consistency. The efficient execution of commands in such environments, however, requires effective scheduling strategies. Existing approaches rely on dependency tracking based on pairwise comparison between commands, which introduces scheduling contention. In this paper, we propose a new and highly efficient scheduler for parallel state machine replication. Our scheduler considers batches of commands, instead of commands individually. Moreover, each batch of commands is augmented with a compact data structure that encodes commands information needed to the dependency analysis. We show, by means of experimental evaluation, that our technique outperforms schedulers for parallel state machine replication by a fairly large margin.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114002334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Large Scale Manycore-Aware PIC Simulation with Efficient Particle Binning 基于高效粒子分割的大规模多核感知PIC仿真
H. Nakashima, Yoshiki Summura, Keisuke Kikura, Y. Miyake
{"title":"Large Scale Manycore-Aware PIC Simulation with Efficient Particle Binning","authors":"H. Nakashima, Yoshiki Summura, Keisuke Kikura, Y. Miyake","doi":"10.1109/IPDPS.2017.65","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.65","url":null,"abstract":"We are now developing a manycore-aware implementation of multiprocessed PIC (particle-in-cell) simulation code with automatic load balancing. A key issue of the implementation is how to exploit the wide SIMD mechanism of manycore processors such as Intel Xeon Phi. Our solution is \"particle binning\" to rank all particles in a cell (voxel) in a chunk of SOA (structure-of-arrays) type one-dimensional arrays so that \"particle-push\" and \"current-scatter\" operations on them are efficiently SIMD-vectorized by our compiler. In addition, our sophisticated binning mechanism performs sorting of particles according to their positions \"on-the-fly\", efficiently coping with occasional \"bin overflow\" in a fully multithreaded manner. Our performance evaluation with up to 64 nodes of Cray XC30 and XC40 supercomputers, equipped with Xeon Phi 5120D (Knights Corner) and 7250 (Knights Landing) respectively, not only exhibited good parallel performance, but also proved the effectiveness of our binning mechanism.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116168905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management 弹性缓存:高效的细粒度和粗粒度缓存线管理的GPU缓存架构
Bingchao Li, Ji-zhou Sun, M. Annavaram, N. Kim
{"title":"Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management","authors":"Bingchao Li, Ji-zhou Sun, M. Annavaram, N. Kim","doi":"10.1109/IPDPS.2017.81","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.81","url":null,"abstract":"GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests (to contiguous memory space). To support warp-wide accesses to L1 cache, GPU L1 cache lines are very wide. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences. In this paper, we propose Elastic-Cache that can efficiently support both fine- and coarse-grained L1 cache-line management for applications with both regular and irregular memory access patterns. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra tag storage structure nor reduces the capacity of L1 cache since it stores auxiliary tags for fine-grained L1 cache-line managements in sharedmemory space that is not fully used in many applications. Our experiment shows that Elastic-Cache improves the geo-mean performance of applications with irregular memory access patterns by 58% without degrading performance of applications with regular memory access patterns.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114087177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters 支持SR-IOV的ib集群上MPI应用的高性能虚拟机迁移框架
Jie Zhang, Xiaoyi Lu, D. Panda
{"title":"High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters","authors":"Jie Zhang, Xiaoyi Lu, D. Panda","doi":"10.1109/IPDPS.2017.43","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.43","url":null,"abstract":"High-speed interconnects (e.g. InfiniBand) have been widely deployed on modern HPC clusters. With the emergence of HPC in the cloud, high-speed interconnects have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology, which is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However, recent studies have shown that SR-IOV-based virtual networks prevent virtual machine migration, which is an essential virtualization capability towards high availability and resource provisioning. Although several initial solutions have been pro- posed in the literature to solve this problem, our investigations show that there are still many restrictions on these proposed approaches, such as depending on specific network adapters and/or hypervisors, which will limit the usage scope of these solutions on HPC environments. In this paper, we propose a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clusters. Our proposed method does not need any modification to the hypervisor and InfiniBand drivers and it can efficiently handle virtual machine (VM) migration with SR-IOV IB device. Our evaluation results indicate that the proposed design is able to not only achieve fast VM migration speed but also guarantee the high performance for MPI applications during the migration in the HPC cloud. At the application level, for NPB LU benchmark running inside VM, our proposed design is able to completely hide the migration overhead through the computation and migration overlapping. Furthermore, our proposed design shows good scaling when migrating multiple VMs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114613208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
On Optimizing Distributed Tucker Decomposition for Dense Tensors 密集张量分布Tucker分解的优化研究
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Xing Liu, Prakash Murali, Yogish Sabharwal, D. Sreedhar
{"title":"On Optimizing Distributed Tucker Decomposition for Dense Tensors","authors":"Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Xing Liu, Prakash Murali, Yogish Sabharwal, D. Sreedhar","doi":"10.1109/IPDPS.2017.86","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.86","url":null,"abstract":"The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114847538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems Mimir:用于大型超级计算系统的内存效率和可扩展MapReduce
Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, P. Balaji, M. Taufer
{"title":"Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems","authors":"Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, P. Balaji, M. Taufer","doi":"10.1109/IPDPS.2017.31","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.31","url":null,"abstract":"In this paper we present Mimir, a new implementation of MapReduce over MPI. Mimir inherits the core principles of existing MapReduce frameworks, such as MR-MPI, while redesigning the execution model to incorporate a number of sophisticated optimization techniques that achieve similar or better performance with significant reduction in the amount of memory used. Consequently, Mimir allows significantly larger problems to be executed in memory, achieving large performance gains. We evaluate Mimir with three benchmarks on two highend platforms to demonstrate its superiority compared with that of other frameworks.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
ATM: Approximate Task Memoization in the Runtime System ATM:运行时系统中的近似任务记忆
I. Brumar, Marc Casas, Miquel Moretó, M. Valero, G. Sohi
{"title":"ATM: Approximate Task Memoization in the Runtime System","authors":"I. Brumar, Marc Casas, Miquel Moretó, M. Valero, G. Sohi","doi":"10.1109/IPDPS.2017.49","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.49","url":null,"abstract":"Redundant computations appear during the execution of real programs. Multiple factors contribute to these unnecessary computations, such as repetitive inputs and patterns, calling functions with the same parameters or bad programming habits. Compilers minimize non useful code with static analysis. However, redundant execution might be dynamic and there are no current approaches to reduce these inefficiencies. Additionally, many algorithms can be computed with different levels of accuracy. Approximate computing exploits this fact to reduce execution time at the cost of slightly less accurate results. In this case, expert developers determine the desired tradeoff between performance and accuracy for each application. In this paper, we present Approximate Task Memoization (ATM), a novel approach in the runtime system that transparently exploits both dynamic redundancy and approximation at the task granularity of a parallel application. Memoization of previous task executions allows predicting the results of future tasks without having to execute them and without losing accuracy. To further increase performance improvements, the runtime system can memoize similar tasks, which leads to task approximate computing. By defining how to measure task similarity and correctness, we present an adaptive algorithm in the runtime system that automatically decides if task approximation is beneficial or not. When evaluated on a real 8-core processor with applications from different domains (financial analysis, stencil-computation, machine-learning and linear-algebra), ATM achieves a 1.4x average speedup when only applying memoization techniques. When adding task approximation, ATM achieves a 2.5x average speedup with an average 0.7% accuracy loss (maximum of 3.2%).","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131355439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Partitioning Low-Diameter Networks to Eliminate Inter-Job Interference 划分低直径网络以消除作业间干扰
Nikhil Jain, A. Bhatele, Xiang Ni, T. Gamblin, L. Kalé
{"title":"Partitioning Low-Diameter Networks to Eliminate Inter-Job Interference","authors":"Nikhil Jain, A. Bhatele, Xiang Ni, T. Gamblin, L. Kalé","doi":"10.1109/IPDPS.2017.91","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.91","url":null,"abstract":"On most supercomputers, except some torus network based systems, resource managers allocate nodes to jobs without considering the sharing of network resources by different jobs. Such network-oblivious resource allocations result in link sharing among multiple jobs that can cause significant performance variability and performance degradation for individual jobs. In this paper, we explore low-diameter networks and corresponding node allocation policies that can eliminate inter-job interference. We propose a variation to n-dimensional mesh networks called express mesh. An express mesh is denser than the corresponding mesh network, has a low diameter independent of the number of routers, and is easily partitionable. We compare structural properties and performance of express mesh with other popular low-diameter networks. We present practical node allocation policies for express mesh and fat-tree networks that not only eliminate inter-job interference and performance variability, but also improve overall performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131868253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信