2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献_第6页

ADeLe: Rapid Architectural Simulation for Approximate Hardware 近似硬件的快速架构模拟

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645875

Isaías B. Felzmann, M. M. Susin, Liana Duenha, R. Azevedo, L. Wanner

{"title":"ADeLe: Rapid Architectural Simulation for Approximate Hardware","authors":"Isaías B. Felzmann, M. M. Susin, Liana Duenha, R. Azevedo, L. Wanner","doi":"10.1109/CAHPC.2018.8645875","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645875","url":null,"abstract":"Recent research has introduced approximate hardware units that produce incorrect outputs deterministically or probabilistically for some small subset of inputs but allow significantly higher throughput or lower power than their errorfree counterparts. The integration, validation, and evaluation of these approximate units in architectures and processors, however, remains challenging. In this paper, we introduce ADeLe, a high-level language for the description, configuration, and integration of approximate hardware units into processors. ADeLe reduces the design effort for approximate hardware by modeling approximations at a high level of abstraction and automatically injecting them into a processor model for architectural simulation. Approximations in ADeLe may modify or completely replace the functional behavior of instructions according to user-defined policies. Instructions may be approximated deterministically or probabilistically (e.g., based on operating voltage and frequency). To allow for controlled testing, approximations may be enabled and disabled from software. Energy is automatically accounted based on customizable models that consider the potential power savings of the approximations that are enabled in the system. ADeLe provides designers with a generic and flexible verification framework, allowing them to easily evaluate the energy-quality trade-offs of their designs in applications. We demonstrate the language and corresponding framework by introducing different approximation techniques into a processor model, on top of which we run selected applications. We demonstrate ADeLe using 6 approximate designs with 4 image processing and 2 floating point applications. Our experiments show how ADeLe may be used to generate approximate CPUs and to evaluate energy-quality trade-offs for different applications with reduced effort.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117003666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Energy - Efficient IaaS-PaaS Co-Design for Flexible Cloud Deployment of Scientific Applications 高效节能的IaaS-PaaS协同设计，用于科学应用的灵活云部署

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/cahpc.2018.8645888

David Guyon, Anne-Cécile Orgerie, C. Morin

{"title":"Energy - Efficient IaaS-PaaS Co-Design for Flexible Cloud Deployment of Scientific Applications","authors":"David Guyon, Anne-Cécile Orgerie, C. Morin","doi":"10.1109/cahpc.2018.8645888","DOIUrl":"https://doi.org/10.1109/cahpc.2018.8645888","url":null,"abstract":"Reducing the massive amount of energy consumed by cloud datacenters becomes of major importance. In the usual approach where resources are consolidated into fewer servers in order to power down the others, it still remains periods of time when servers are not fully utilized. Consequently, it exists unused resources that are not exploited although they could be used to execute applications compatible with the variable availability of these resources. In this work, we propose a cloud system where the Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) layers interact to find execution trade-offs that exploit the unused resources at IaaS level. PaaS users are involved in the energy optimization by proposing to delay their executions and adapt resource sizes in order to fit with the available unused resources. Our evaluation by simulation is based on real data and expresses a realistic large scale cloud scenario. Results show that according to the proportion of energy-aware users, this system is able to reduce the amount of servers by using resources that would have been wasted otherwise. Therefore, our solution allows datacenters to consume less energy than with usual resource managers where all applications start their execution at submission time with their initial resource size.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Highly Scalable Stencil-Based Matrix-Free Stochastic Estimator for the Diagonal of the Inverse 基于高可伸缩模板的逆对角线无矩阵随机估计

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645868

F. Verbosio, Jurai Kardos, Mauro Bianco, O. Schenk

{"title":"Highly Scalable Stencil-Based Matrix-Free Stochastic Estimator for the Diagonal of the Inverse","authors":"F. Verbosio, Jurai Kardos, Mauro Bianco, O. Schenk","doi":"10.1109/CAHPC.2018.8645868","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645868","url":null,"abstract":"Selected inversion problems must be addressed in several research fields like physics, genetics, weather forecasting, and finance, in order to extract selected entries from the inverse of large, sparse matrices. State-of-the-art algorithms are either based on the LU factorization or on an iterative process. Both approaches present computational bottlenecks related to prohibitive memory requirements or extremely high running time for large-scale matrices. In recent years, in order to overcome such limitations, an alternative approach for computing stochastic estimates of the inverse entries has been developed. In this work, we present a stochastic estimator for the diagonal of the inverse and test its performance on a dataset of symmetric, positive semidefinite matrices coming from the field of atomistic quantum transport simulations with nonequilibrium Green's functions (NEGF) formalism. In such a framework, it is required to solve the Schrödinger equation thousands of times, demanding the computation of the diagonal of the retarded Green's function, i.e., the inverse of a large, sparse matrix including open boundary conditions. Given the nature and the structure of the NEGF matrices, our stochastic estimation framework exploits the capabilities of a stencil-based, matrix-free code, avoiding the fill-in and lack of scalability that the LV-based methods present for three-dimensional nanoelectronic devices. We also illustrate the impact of the stochastic estimator by comparing its accuracy against existing methods and demonstrate its scalability performance on the “Piz Daint” cluster at the Swiss National Supercomputing Center, preparing for postpetascale three-dimensional nanoscale calculations.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134143477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Message from the General Chairs 主席致辞

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/cahpc.2018.8645950

AITest

引用次数: 0

Balancing Load of GPU Subsystems to Accelerate Image Reconstruction in Parallel Beam Tomography GPU子系统负载均衡加速并行束层析成像图像重建

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645862

S. Chilingaryan, E. Ametova, A. Kopmann, A. Mirone

{"title":"Balancing Load of GPU Subsystems to Accelerate Image Reconstruction in Parallel Beam Tomography","authors":"S. Chilingaryan, E. Ametova, A. Kopmann, A. Mirone","doi":"10.1109/CAHPC.2018.8645862","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645862","url":null,"abstract":"Synchrotron X-ray imaging is a powerful method to investigate internal structures down to the micro and nanoscopic scale. Fast cameras recording thousands of frames per second allow time-resolved studies with a high temporal resolution. Fast image reconstruction is essential to provide the synchrotron instrumentation with the imaging information required to track and control the process under study. Traditionally Filtered Back Projection algorithm is used for tomographic reconstruction. In this article, we discuss how to implement the algorithm on nowadays GPGPU architectures efficiently. The key is to achieve balanced utilization of available GPU subsystems. We present two highly optimized algorithms to perform back projection on parallel hardware. One is relying on the texture engine to perform reconstruction, while another one utilizes the Core computational units of the GPU. Both methods outperform current state-of-the-art techniques found in the standard reconstructions codes significantly. Finally, we propose a hybrid approach combining both algorithms to better balance load between G PU subsystems. It further boosts the performance by about 30 % on NVIDIA Pascal micro-architecture.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121959628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A New Efficient Parallel Algorithm for Minimum Spanning Tree 一种新的高效并行最小生成树算法

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645917

J. Vasconcellos, E. Cáceres, H. Mongelli, S. W. Song

引用次数: 1

Phase-Based Data Placement Scheme for Heterogeneous Memory Systems 异构存储系统中基于相位的数据放置方案

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645903

M. Laghari, Najeeb Ahmad, D. Unat

{"title":"Phase-Based Data Placement Scheme for Heterogeneous Memory Systems","authors":"M. Laghari, Najeeb Ahmad, D. Unat","doi":"10.1109/CAHPC.2018.8645903","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645903","url":null,"abstract":"Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. The multiple memories can vary in latency, bandwidth and capacity characteristics across systems and they come in various configurations that can be managed by the programmer. This introduces an added programming complexity for the programmer. In this paper, we present a dynamic phase-based data placement scheme to assist the programmer in making decisions about program object allocations. We devise a cost model to assess the benefit of having an object in one type of memory over the other and apply the cost model at every application phase to capture the dynamic behaviour of an application. Our cost model takes into account the reference counts of objects and incurred transfer overhead when making a suggestion. In addition, objects can be transferred across memories asynchronously between phases to mask some of the transfer overhead. We test our cost model with a diverse set of applications from NAS Parallel and Rodinia benchmarks and perform experiments on Intel KNL, which is equipped with a high bandwidth memory (MCDRAM) and a high capacity memory (DDR). Our dynamic phase-based data placement performs better than initial placement and achieves comparable or better performance than cache mode of MCDRAM.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Partitioning Convolutional Neural Networks for Inference on Constrained Internet-of-Things Devices 基于约束物联网设备的分区卷积神经网络推理

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645927

F. M. C. D. Oliveira, E. Borin

{"title":"Partitioning Convolutional Neural Networks for Inference on Constrained Internet-of-Things Devices","authors":"F. M. C. D. Oliveira, E. Borin","doi":"10.1109/CAHPC.2018.8645927","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645927","url":null,"abstract":"With the prospects of a world in which the IoT will be pervasive in a near future, the great amount of data produced by its devices will have to be processed and interpreted in an efficient and intelligent way. One approach to do that is the use of fog computing, in which the network infrastructure and the devices themselves can process data. Deep learning techniques have been successfully applied to the interpretation of the kind of data generated by the IoT, however, even the inference execution of convolutional neural networks may be computationally costly when resource-limited devices are considered. In order to enable the execution of neural network models on resource-constrained IoT systems, the code may be partitioned and distributed among multiple devices. Different partitioning approaches are possible, nonetheless, some of them increase the amount of communication that needs to be performed between the IoT devices. In this work, we propose KLP, a Kernighan-and-Lin-based partitioning algorithm that partitions neural network models for efficient distributed execution on multiple IoT devices. Our results show that KLP is capable of producing partitions that require up to 4.5 times less communication than partitioning approaches used by TensorFlow and other frameworks.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116674567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Runtime Management of Data Quality for Scientific Observatories Using Edge and In-Transit Resources 利用边缘和在途资源的科学观测站数据质量运行时管理

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645940

A. Zamani, Daniel Balouek-Thomert, J. J. Villalobos, I. Rodero, M. Parashar

{"title":"Runtime Management of Data Quality for Scientific Observatories Using Edge and In-Transit Resources","authors":"A. Zamani, Daniel Balouek-Thomert, J. J. Villalobos, I. Rodero, M. Parashar","doi":"10.1109/CAHPC.2018.8645940","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645940","url":null,"abstract":"Modern Cyberinfrastructures (CIs) operate to bring content produced from remote data sources such as sensors and scientific instruments and deliver it to end users and workflow applications. Maintaining data quality/resolution and on-time data delivery while considering an increasing number of computing, storage and network resources requires a reactive system, able to adapt to changing demands. In this paper, we propose a modelization of such system by expressing the dynamic stage of resources in the context of edge and in-transit computing. By considering resource utilization, approximation techniques and users' constraints, our proposed engine is generating mappings of workflow stages on heterogeneous geo-distributed resources. We specifically propose a runtime management layer that adapts the data resolution being delivered to the users by implementing feedback loops over the resources involved in the delivery and processing of the data streams. We implement our model into a subscription-based data streaming framework which enables integration of large facilities and advanced CIs. Experimental results show that dynamically adapting data resolution can overcome bandwidth limitation in wide area streaming analytics.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125024074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Hybrid MPI+openMP Implementation of eXtended Discrete Element Method 扩展离散元法的混合MPI+openMP实现

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645880

Abdoul Wahid Mainassara Checkaraou, A. Rousset, Xavier Besseron, S. Varrette, B. Peters

{"title":"Hybrid MPI+openMP Implementation of eXtended Discrete Element Method","authors":"Abdoul Wahid Mainassara Checkaraou, A. Rousset, Xavier Besseron, S. Varrette, B. Peters","doi":"10.1109/CAHPC.2018.8645880","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645880","url":null,"abstract":"The Extended Discrete Element Method (XDEM) is a novel and innovative numerical simulation technique that extends classical Discrete Element Method (DEM) (which simulates the motion of granular material), by additional properties such as the chemical composition, thermodynamic state, stress/strain for each particle. It has been applied successfully to numerous industries involving the processing of granular materials such as sand, rock, wood or coke [16], [17]. In this context, computational simulation with (X)DEM has become a more and more essential tool for researchers and scientific engineers to set up and explore their experimental processes. However, increasing the size or the accuracy of a model requires the use of High Performance Computing (HPC) platforms over a parallelized implementation to accommodate the growing needs in terms of memory and computation time. In practice, such a parallelization is traditionally obtained using either MPI (distributed memory computing), openMP (shared memory computing) or hybrid approaches combining both of them. In this paper, we present the results of our effort to implement an openMP version of XDEM allowing hybrid MPI+openMP simulations (XDEM being already parallelized with MPI). Far from the basic openMP paradigm and recommendations (which simply summarizes by decorating the main computation loops with a set of openMP pragma), the openMP parallelization of XDEM required a fundamental code re-factoring and careful tuning in order to reach good performance. There are two main reasons for those difficulties. Firstly, XDEM is a legacy code developed for more than 10 years, initially focused on accuracy rather than performance. Secondly, the particles in a DEM simulation are highly dynamic: they can be added, deleted and interaction relations can change at any timestep of the simulation. Thus this article details the multiple layers of optimization applied, such as a deep data structure profiling and reorganization, the usage of fast multithreaded memory allocators and of advanced process/thread-to-core pinning techniques. Experimental results evaluate the benefit of each optimization individually and validate the implementation using a real-world application executed on the HPC platform of the University of Luxembourg. Finally, we present our Hybrid MPI+openMP results with a 15%-20% performance gain and how it overcomes scalability limits (by increasing the number of compute cores without dropping of performances) of XDEM-based pure MPI simulations.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129561360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10