Proceedings of the 17th ACM International Conference on Computing Frontiers最新文献_第3页

Contention-aware application performance prediction for disaggregated memory systems 面向分解内存系统的竞争感知应用程序性能预测

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3392625

F. V. Zacarias, Rajiv Nishtala, P. Carpenter

{"title":"Contention-aware application performance prediction for disaggregated memory systems","authors":"F. V. Zacarias, Rajiv Nishtala, P. Carpenter","doi":"10.1145/3387902.3392625","DOIUrl":"https://doi.org/10.1145/3387902.3392625","url":null,"abstract":"Disaggregated memory has recently been proposed as a way to allow flexible and fine-grained allocation of memory capacity to compute jobs. This paper makes an important step towards effective resource allocation on disaggregated memory systems. Specifically, we propose a generic approach to predict the performance degradation due to sharing of disaggregated memory. In contrast to prior work, cache capacity is not shared among multiple applications, which removes a major contributor to application performance. For this reason, our analysis is driven by the demand for memory bandwidth, which has been shown to have an important effect on application performance. We show that profiling the application slowdown often involves significant experimental error and noise, and to this end, we improve the accuracy by linear smoothing of the sensitivity curves. We also show that contention is sensitive to the ratio between read and write memory accesses, and we address this sensitivity by building a family of sensitivity curves according to the read/write ratios. Our results show that the methodology predicts the slowdown in application performance subject to memory contention with an average error of 1.19% and max error of 14.6%. Compared with state-of-the-art, the relative improvements are almost 24% on average and 33% for the worst case.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131801103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systems 基于非相干突发和基于相干缓存线的内存系统之间的开源桥接的设计

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3392631

Matheus A. Cavalcante, Andreas Kurth, Fabian Schuiki, L. Benini

{"title":"Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systems","authors":"Matheus A. Cavalcante, Andreas Kurth, Fabian Schuiki, L. Benini","doi":"10.1145/3387902.3392631","DOIUrl":"https://doi.org/10.1145/3387902.3392631","url":null,"abstract":"In heterogeneous computer architectures, the serial part of an application is coupled with domain-specific accelerators that promise high computing throughput and efficiency across a wide range of applications. In such systems, the serial part of a program is executed on a Central Processing Unit (CPU) core optimized for single-thread performance, while parallel sections are offloaded to Programmable Manycore Accelerators (PMCAs). This heterogeneity requires CPU cores and PMCAs to share data in memory efficiently, although CPUs rely on a coherent memory system where data is transferred in cache lines, while PMCAs are based on non-coherent scratchpad memories where data is transferred in bursts by DMA engines. In this paper, we tackle the challenges and hardware complexity of bridging the gap from a non-coherent, burst-based memory hierarchy to a coherent, cache-line-based one. We design and implement an open-source hardware module that reaches 97% peak throughput over a wide range of realistic linear algebra kernels and is suited for a wide spectrum of memory architectures. Implemented in a state-of-the-art 22 nm FD-SOI technology, our module bridges up to 650 Gbps at 130 fJ/bit and has a complexity of less than 1 kGE/Gbps.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132660972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

StoneCutter 石匠

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3394029

J. Leidel, D. Donofrio, Frank Conlon

{"title":"StoneCutter","authors":"J. Leidel, D. Donofrio, Frank Conlon","doi":"10.1145/3387902.3394029","DOIUrl":"https://doi.org/10.1145/3387902.3394029","url":null,"abstract":"As the density and capability of reconfigurable computing using FPGAs continues to increase and access to large scale ASIC integration continues to increase, research activities associated with high level synthesis flows have expanded at a similar rate. The goal of these research efforts is to reduce the time and effort required to construct and deploy application-specific architectures. However, these synthesis techniques often force users to consider the entire circuit design space in order to develop a successful implementation. This lack of design specificity often results in hardware design implementations that are difficult to program, difficult to reuse in future designs and make sub-optimal use of hardware resources. In this work we introduce the StoneCutter instruction set design language and tool infrastructure. StoneCutter provides a familiar, C-like language construct by which to develop the implementation for individual, programmable instructions. The LLVM-based StoneCutter compiler performs individual instruction and whole-ISA optimizations in order to generate a high performance, Chisel HDL representation of the target design. Utilizing the existing Chisel tools, users can also generate C++ cycle accurate simulation models as well as Verilog representations of the target design. As a result, StoneCutter provides a very rapid design environment for development and experimentation.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125368270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time-sliced quantum circuit partitioning for modular architectures 模块化架构的时间切片量子电路划分

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3392617

Jonathan M. Baker, Casey Duckering, Alexander P. Hoover, F. Chong

引用次数: 27

Quantum splines for non-linear approximations 非线性近似的量子样条

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3394032

A. Macaluso, L. Clissa, Stefano Lodi, Claudio Sartori

引用次数: 3

Freeway 高速公路

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3394028

Yifan Shen, Ke Liu, Ziting Guo, Wenli Zhang, Guanghui Zhang, V. Aggarwal, Mingyu Chen

引用次数: 1

Enabling mixed-precision quantized neural networks in extreme-edge devices 在极端边缘设备中实现混合精度量化神经网络

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3394038

Nazareno Bruschi, Angelo Garofalo, Francesco Conti, Giuseppe Tagliavini, D. Rossi

{"title":"Enabling mixed-precision quantized neural networks in extreme-edge devices","authors":"Nazareno Bruschi, Angelo Garofalo, Francesco Conti, Giuseppe Tagliavini, D. Rossi","doi":"10.1145/3387902.3394038","DOIUrl":"https://doi.org/10.1145/3387902.3394038","url":null,"abstract":"The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21× to 25× faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15× to 21× better energy efficiency.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132398132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

HiLSM

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3392621

Wenjie Li, Dejun Jiang, Jin Xiong, Yungang Bao

{"title":"HiLSM","authors":"Wenjie Li, Dejun Jiang, Jin Xiong, Yungang Bao","doi":"10.1145/3387902.3392621","DOIUrl":"https://doi.org/10.1145/3387902.3392621","url":null,"abstract":"In order to ensure data durability and crash consistency, the LSM-tree based key-value stores suffer from high WAL synchronization overhead. Fortunately, the advent of NVM offers an opportunity to address this issue. However, NVM is currently too expensive to meet the demand of massive storage systems. Therefore, the hybrid NVM and SSD storage system provides a more cost-efficient solution. This paper proposes HiLSM, a key-value store for hybrid NVM-SSD storage systems. According to the characteristics of hybrid storage mediums, HiLSM adopts hybrid data structures consisting of the log-structured memory and the LSM-tree. Aiming at the issue of write stalls in write intensive scenario, a fine-grained data migration strategy is proposed to make the data migration start as early as possible. Aiming at the performance gap between NVM and SSD, a multi-threaded data migration strategy is proposed to make the data migration complete as soon as possible. Aiming at the LSM-tree's inherent issue of write amplification, a data filtering strategy is proposed to make data updates be absorbed in NVM as much as possible. We compare HiLSM with the state-of-the-art key-value stores via extensive experiments and the results show that HiLSM achieves 1.3x higher throughput for write, 10x higher throughput for read and 79% less write traffic under the skewed workload.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122299815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

SoundFactory SoundFactory

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3394036

A. Scionti, S. Ciccia, O. Terzo

{"title":"SoundFactory","authors":"A. Scionti, S. Ciccia, O. Terzo","doi":"10.1145/3387902.3394036","DOIUrl":"https://doi.org/10.1145/3387902.3394036","url":null,"abstract":"The proliferation of smart connected devices using digital assistants activated by voice commands (e.g., Apple Siri, Google Assistant, Amazon Alexa, etc.) is raising the interest in algorithms to localize and recognize audio sources. Among the others, deep neural networks (DNNs) are seen as a promising approach to accomplish such task. Unlike other approaches, DNNs can categorize received events, thus discriminating between events of interests and not even in presence of noise. Despite their advantages, DNNs require large datasets to be trained. Thus, tools for generating datasets are of great value, being able to accelerate the development of advanced learning models. This paper presents SoundFactory, a framework for simulating the propagation of sound waves (also considering noise, reverberation, reflection, attenuation, and other interfering waves) and the microphone array response to such sound waves. As such, SoundFactory allows to easily generate datasets to train deep neural networks which are at the basis of modern applications. SoundFactory is flexible enough to simulate many different microphone array configurations, thus covering a large set of use cases. To demonstrate the capabilities offered by SoundFactory, we generated a dataset and trained two different (rather simple) learning models against them, achieving up to 97% of accuracy. The quality of the generated dataset has been also assessed comparing the microphone array model responses with the real ones.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129707179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficient architecture design for the AES-128 algorithm on embedded systems 嵌入式系统中AES-128算法的高效架构设计

Proceedings of the 17th ACM International Conference on Computing Frontiers Pub Date : 2020-05-11 DOI: 10.1145/3387902.3392624

Rupam Mondal, H. Ngo, James Shey, R. Rakvic, Owens Walker, Dane Brown

{"title":"Efficient architecture design for the AES-128 algorithm on embedded systems","authors":"Rupam Mondal, H. Ngo, James Shey, R. Rakvic, Owens Walker, Dane Brown","doi":"10.1145/3387902.3392624","DOIUrl":"https://doi.org/10.1145/3387902.3392624","url":null,"abstract":"Many applications make use of the edge devices in wireless sensor networks (WSNs), including video surveillance, traffic monitoring and enforcement, personal and health care, gaming, habitat monitoring, and industrial process control. However, these edge devices are resource-limited embedded systems that require a low-cost, low-power, and high-performance encryption/decryption solution to prevent attacks such as eavesdropping, message modification, and impersonation. This paper proposes a field-programmable gate array (FPGA) based design and implementation of the Advanced Encryption Standard (AES) algorithm for encryption and decryption using a parallel-pipeline architecture with a data forwarding mechanism that efficiently utilizes on-chip memory modules and massive parallel processing units to support a high throughput rate. Hardware designs that optimize the implementation of the AES algorithm are proposed to minimize resource allocation and maximize throughput. These designs are shown to outperform existing solutions in the literature. Additionally, a rapid prototype of a complete system-on-chip (SoC) solution that employs the proposed design on a configurable platform has been developed and proven to be suitable for real-time applications.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132054607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3