2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

筛选
英文 中文
Towards Pervasive Containerization of HPC Job Schedulers 迈向高性能计算作业调度器的普及容器化
C. Cérin, Nicolas Grenèche, Tarek Menouer
{"title":"Towards Pervasive Containerization of HPC Job Schedulers","authors":"C. Cérin, Nicolas Grenèche, Tarek Menouer","doi":"10.1109/SBAC-PAD49847.2020.00046","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00046","url":null,"abstract":"In cloud computing, elasticity is defined as \"the degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible\". Adding elasticity to HPC (High Performance Computing) clusters management systems remains challenging even if we deploy such HPC systems in today's cloud environments. This difficulty is caused by the fact that HPC jobs scheduler needs to rely on a fixed set of resources. Every change of topology (adding or removing computing resources) leads to a global restart of the HPC jobs scheduler. This phenomenon is not a major drawback because it provides a very effective way of sharing a fixed set of resources but we think that it could be complemented by a more elastic approach. Moreover, the elasticity issue should not be reduced to the scaling of resources issues. Clouds also enable access to various technologies that enhance the services offer to users. In this paper, our approach is to use containers technology to instantiate a tailored HPC environment based on the user's reservation constraints. We claim that the introduction and use of containers in HPC job schedulers allow better management of resources, in a more economical way. From the use case of SLURM, we release a methodology for 'containerization' of HPC jobs schedulers which is pervasive i.e. spreading widely throughout any layers of job schedulers. We also provide initial experiments demonstrating that our containerized SLURM system is operational and promising.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116428604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Robotic Communication Middleware Combining High Performance and High Reliability 一种高性能与高可靠性相结合的机器人通信中间件
Wei Liu, Hao Wu, Ziyue Jiang, Yifan Gong, Jiangming Jin
{"title":"A Robotic Communication Middleware Combining High Performance and High Reliability","authors":"Wei Liu, Hao Wu, Ziyue Jiang, Yifan Gong, Jiangming Jin","doi":"10.1109/SBAC-PAD49847.2020.00038","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00038","url":null,"abstract":"With the significant advances of AI technology, intelligent robotic systems have achieved remarkable development and profound effects. To enable massive data transmissionin an efficient and reliable way, both high performance andhigh reliability should be taken into account in system design. However, the conventional communication middleware used in the majority of autonomous robotic systems, is based on socked-based methods, which always lead to high latency. Moreover, some sophisticated communication middleware utilizes shared memory upon ring buffers for high performance without consideration of the reliability. To obtain both high performance and high reliability, we employ shared memory for performance improvement and propose a novel socket-based communication control algorithm to improve reliability during data transmission. Furthermore, based on the proposed algorithm, we implement a novel robotic communication middleware, named Robust-Z, combining both high performance and high reliability. Experimental results show that (1) Robust-Z is able to gain up to 41% and 5% performance improvement compared to ROS2 and Apollo CyberRT, respectively; (2) Robust-Z is able to provide crash safety and reduce 5.2% data missing rate compared with CyberRT.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128484342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
OmpTracing: Easy Profiling of OpenMP Programs comptracing: OpenMP程序的简单分析
Vitoria Pinho, H. Yviquel, M. Pereira, G. Araújo
{"title":"OmpTracing: Easy Profiling of OpenMP Programs","authors":"Vitoria Pinho, H. Yviquel, M. Pereira, G. Araújo","doi":"10.1109/SBAC-PAD49847.2020.00042","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00042","url":null,"abstract":"One of the greatest challenges of modern computing is the development of software for parallel execution. To address such challenge, programmers use profiling tools to record relevant operations, like the communications that the different parts of an application carried out during its execution. Profilers can be used to analyze the execution of the application as they enable the programmer to check its performance hot spots and sources of overhead. This paper introduces the OmpTracing library, a lightweight tool that eases the task of profiling OpenMP based applications without the need to inject expensive profiling code into the program. OmpTracing leverages on OMPT, an application programming interface that provides an introspection mechanism of the OpenMP runtime, and that enables the programmer to capture execution details of the parallelized application while generating notifications about significant program events.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129245123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC 基于TFLITE-SOC的加速器设计空间探索和端到端DNN评估
Nicolas Bohm Agostini, Shi Dong, Elmira Karimi, Marti Torrents Lapuerta, José Cano, José L. Abellán, D. Kaeli
{"title":"Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC","authors":"Nicolas Bohm Agostini, Shi Dong, Elmira Karimi, Marti Torrents Lapuerta, José Cano, José L. Abellán, D. Kaeli","doi":"10.1109/SBAC-PAD49847.2020.00013","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00013","url":null,"abstract":"Recently there has been a rapidly growing demand for faster machine learning (ML) processing in data centers and migration of ML inference applications to edge devices. These developments have prompted both industry and academia to explore custom accelerators to optimize ML executions for performance and power. However, identifying which accelerator is best equipped for performing a particular ML task is challenging, especially given the growing range of ML tasks, the number of target environments, and the limited number of integrated modeling tools. To tackle this issue, it is of paramount importance to provide the computer architecture research community with a common framework capable of performing a comprehensive, uniform, and fair comparison across different accelerator designs targeting a particular ML task. To this aim, we propose a new framework named TFLITE-SOC (System On Chip) that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language's hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis. In addition to providing rapid prototyping, integrated benchmarking, and a range of platform configurations, TFLITE-SOC offers comprehensive performance analysis of accelerator occupancy and execution time breakdown as well as a rich set of modules that can be used by new accelerators to implement scaling up studies and optimized memory transfer protocols. We present our framework and demonstrate its utility by considering the design space of a TPU-like systolic array and describing possible directions for optimization. Using a compression technique, we implement an optimization targeting reducing the memory traffic between DRAM and on-device buffers. Compared to the baseline accelerator, our optimized design shows up to 1.26x speedup on accelerated operations and up to 1.19x speedup on end-to-end DNN execution.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123368446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Reliable and Energy-aware Mapping of Streaming Series-parallel Applications onto Hierarchical Platforms 流串并联应用到分层平台的可靠和能量感知映射
Changjiang Gou, A. Benoit, Mingsong Chen, L. Marchal, Tongquan Wei
{"title":"Reliable and Energy-aware Mapping of Streaming Series-parallel Applications onto Hierarchical Platforms","authors":"Changjiang Gou, A. Benoit, Mingsong Chen, L. Marchal, Tongquan Wei","doi":"10.1109/SBAC-PAD49847.2020.00026","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00026","url":null,"abstract":"Streaming applications come from various application fields such as physics, and many can be represented as a series-parallel dependence graph. We aim at minimizing the energy consumption of such applications when executed on a hierarchical platform, by proposing novel mapping strategies. Dynamic voltage and frequency scaling (DVFS) is used to reduce the energy consumption, and we ensure a reliable execution by either executing a task at maximum speed, or by triplicating it. In this paper, we propose a structure rule to partition the series-parallel applications, and we prove that the optimization problem is NP-complete. We are able to derive a dynamic programming algorithm for the special case of linear chains, which provides an interesting heuristic and a building block for designing heuristics for the general case. The heuristics performance is compared to a baseline solution, where each task is executed at maximum speed. Simulations demonstrate that significant energy savings can be obtained.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114626939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Optimal Model for Optimizing the Placement and Parallelism of Data Stream Processing Applications on Cloud-Edge Computing 基于云边缘计算的数据流处理应用程序布局和并行性优化模型
Felipe Rodrigo de Souza, M. Assunção, E. Caron, A. Veith
{"title":"An Optimal Model for Optimizing the Placement and Parallelism of Data Stream Processing Applications on Cloud-Edge Computing","authors":"Felipe Rodrigo de Souza, M. Assunção, E. Caron, A. Veith","doi":"10.1109/SBAC-PAD49847.2020.00019","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00019","url":null,"abstract":"The Internet of Things has enabled many application scenarios where a large number of connected devices generate unbounded streams of data, often processed by data stream processing frameworks deployed in the cloud. Edge computing enables offloading processing from the cloud and placing it close to where the data is generated, thereby reducing the time to process data events and deployment costs. However, edge resources are more computationally constrained than their cloud counterparts, raising two interrelated issues, namely deciding on the parallelism of processing tasks (a.k.a. operators) and their mapping onto available resources. In this work, we formulate the scenario of operator placement and parallelism as an optimal mixed-integer linear programming problem. The proposed model is termed as Cloud-Edge data Stream Placement (CESP). Experimental results using discrete-event simulation demonstrate that CESP can achieve an end-to-end latency at least ≃ 80% and monetary costs at least ≃ 30% better than traditional cloud deployment.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125755805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Performance Analysis and Optimization of the Vector-Kronecker Product Multiplication 向量-克罗内克积乘法的性能分析与优化
Alexandre Azevedo, C. Bentes, Maria Clicia Stelling de Castro, C. Tadonki
{"title":"Performance Analysis and Optimization of the Vector-Kronecker Product Multiplication","authors":"Alexandre Azevedo, C. Bentes, Maria Clicia Stelling de Castro, C. Tadonki","doi":"10.1109/SBAC-PAD49847.2020.00044","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00044","url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, used to model complex systems using structured descriptions. This operation needs to be computed efficiently, since it is a critical kernel for iterative algorithms. In this work, we focus on the vector-kronecker product operation, where we present an in-depth performance analysis of a sequential and a parallel algorithm previously proposed. Based on this analysis, we proposed three optimizations: changing the memory access pattern, reducing load imbalance and manually vectorizing some portions of the code with Intel SSE4.2 intrinsics. The obtained results show better cache usage and load balance, thus improving the performance, especially for larger matrices.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131907869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Scalable and Efficient Spatial-Aware Parallelization Strategies for Multimedia Retrieval 面向多媒体检索的可扩展高效空间感知并行化策略
Guilherme Andrade, George Teodoro, R. Ferreira
{"title":"Scalable and Efficient Spatial-Aware Parallelization Strategies for Multimedia Retrieval","authors":"Guilherme Andrade, George Teodoro, R. Ferreira","doi":"10.1109/SBAC-PAD49847.2020.00027","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00027","url":null,"abstract":"Similarity search is a key operation in several multimedia applications, including online Content-Based Multimedia Retrieval (CBMR) services. These applications have to deal with very large databases and are submitted to high query rates. In this context, scalability in distributed memory system is critical to assemble the required computing power and memory space. However, we have identified that the Data Equal Split (DES) parallelization and associated data partition strategy employed by the related works on the domain have limitations in terms of efficiency and scalability. Therefore, in this paper, we developed and implemented a framework for similarity search execution on distributed memory machines and proposed a novel class of data partition strategies that takes into account the data spatial organization in its distribution. This approach leads to a reduction in communication traffic and in costs associated with processing each task in local searches carried out in the distributed machine. Our approach attained a speedup of 2.4× on top of DES in the baseline case (5 nodes) and also achieves higher scalability efficiency and is 14.5× faster when 160 nodes are used. In fact, our novel data organization led to superlinear scalability in all configurations evaluated.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124209536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Energy-Efficient Time Series Analysis Using Transprecision Computing 使用精确计算的节能时间序列分析
Ivan Fernandez, Ricardo Quislant, E. Gutiérrez, O. Plata
{"title":"Energy-Efficient Time Series Analysis Using Transprecision Computing","authors":"Ivan Fernandez, Ricardo Quislant, E. Gutiérrez, O. Plata","doi":"10.1109/SBAC-PAD49847.2020.00022","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00022","url":null,"abstract":"Time series analysis is a key step in monitoring and predicting events over time in domains such as epidemiology, genomics, medicine, seismology, speech recognition, and economics. Matrix Profile has been recently proposed as a promising technique to perform time series analysis. For each subsequence, the matrix profile provides the most similar neighbour in the time series. This computation requires a huge amount of floating-point (FP) operations, which are a major contributor (approximately 50%) to the energy consumption in modern computing platforms. Transprecision Computing has recently emerged as a promising approach to improve energy efficiency and performance by tolerating some loss of precision in FP operations. In this work, we study how the matrix profile parallel algorithms benefit from transprecision computing using a recently proposed transprecision FPU. This FPU is intended to be integrated on embedded devices as part of RISC-V processors, FPGAs or ASICs to perform energy-efficient time series analysis. To this end, we propose an accuracy metric to compare the results with the double precision matrix profile. We use this metric to explore a wide range of exponent and mantissa combinations for a variety of datasets, as well as a mixed precision and a vectorized approach. Our analysis reveals that the energy consumption is reduced up to 3.3x compared with double precision approaches, while only slightly affecting the accuracy.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114743849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MASA-StarPU: Parallel Sequence Comparison with Multiple Scheduling Policies and Pruning MASA-StarPU:具有多调度策略和修剪的并行序列比较
Rafael A. Lopes, Samuel Thibault, A. Melo
{"title":"MASA-StarPU: Parallel Sequence Comparison with Multiple Scheduling Policies and Pruning","authors":"Rafael A. Lopes, Samuel Thibault, A. Melo","doi":"10.1109/SBAC-PAD49847.2020.00039","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00039","url":null,"abstract":"Sequence comparison tools based on the Smith-Waterman (SW) algorithm provide the optimal result but have high execution times when the sequences compared are long, since a huge dynamic programming (DP) matrix is computed. Block pruning is an optimization that does not compute some parts of the DP matrix and can reduce considerably the execution time when the sequences compared are similar. However, block pruning's resulting task graph is dynamic and irregular. Since different pruning scenarios lead to different pruning shapes, we advocate that no single scheduling policy will behave the best for all scenarios. This paper proposes MASA-StarPU, a sequence aligner that integrates the domain specific framework MASA to the generic programming environment StarPU, creating a tool which has the benefits of StarPU (i.e., multiple task scheduling policies) and MASA (i.e., fast sequence alignment). MASA-StarPU was executed in two different multicore platforms and the results show that a bad choice of the scheduling policy may have a great impact on the performance. For instance, using 24 cores, the 5M x 5M comparison took 1484s with the dmdas policy whereas the same comparison took 3601s with lws. We also show that no scheduling policy behaves the best for all scenarios.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124482488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信