{"title":"High performance network-on-chip simulation by interval-based timing predictions","authors":"Sascha Roloff, Frank Hannig, J. Teich","doi":"10.1145/3139315.3139320","DOIUrl":"https://doi.org/10.1145/3139315.3139320","url":null,"abstract":"Current multi- and many-core computer architectures heavily use Network-on-Chip (NoC communication in order to meet the increased bandwidth demands between the processors and for reasons of scalability. For the proper analysis of concurrency utilization, and workload distribution of parallel multi-media applications running on such NoC-based architectures, high-speed simulation techniques are required. Apart from accurate timing simulation of compute resources, it is of utmost importance also to accurately model the delays caused by the packet-based network communication in order to reliably verify performance numbers, or to identify any bottlenecks of the underlying architecture, or to study workload distribution techniques or routing algorithms. In this paper, we present a novel simulation approach for NoCs that allows to simulate such communication delays equally accurate but much faster in average than on a flit-by-flit basis. We propose novel algorithmic and analytical techniques that predict the transmission intervals dynamically based on the arrival of communication requests, actual congestion in the NoC, routing information, packet lengths, and other parameters. According to such predictions, the simulation time may in many cases be automatically advanced, thus reducing the number of events to process in the simulator to a large extent. The presented NoC simulation technique has been integrated into a system-level multi-core architecture simulator. Experiments in running parallel real-world and multi-media applications on a simulated scalable NoC architecture show that we are able to achieve speedups of three orders of magnitude compared to cycle-accurate NoC simulators, while preserving a timing accuracy of above 95%.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129733221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","authors":"S. Stuijk, Akash Kumar","doi":"10.1145/3139315","DOIUrl":"https://doi.org/10.1145/3139315","url":null,"abstract":"Multimedia and camera-based technologies play an important role in our daily life, and have become among the most relevant technological innovations. These technologies have proliferated into a wide range of application domains like Internet-of-Things (IoT), Cyber-Physical Systems (CPS), Healthcare and Medical Image Processing, Security, Consumer, etc. The evermore increasing computational and communication requirements demanded by current and next generation multimedia and image/video processing devices together with energy constraints which characterize portable devices require innovative design methodologies and tools. The IEEE/ACM ESTIMedia aims to bring together people from different multimedia and imaging-related research communities who have worked separately but did not interact sufficiently to address the challenges facing the design of hardware and software layers of such highly specialized multimedia and image/video processing systems.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129142808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ML-Gov: a machine learning enhanced integrated CPU-GPU DVFS governor for mobile gaming","authors":"Jurn-Gyu Park, N. Dutt, Sung-Soo Lim","doi":"10.1145/3139315.3139317","DOIUrl":"https://doi.org/10.1145/3139315.3139317","url":null,"abstract":"Modern heterogeneous CPU-GPU based mobile architectures that execute intensive mobile games and other graphics applications use software governors to achieve high performance with energy-efficiency. For dynamic and diverse gaming workloads on heterogeneous platforms, existing governors typically utilize statistical or heuristic models assuming linear relationships for a small set of mobile games, resulting in high prediction errors. To overcome these limitations, we propose ML-Gov: a machine learning enhanced integrated CPU-GPU governor that builds tree-based piecewise linear models offline, and deploys these models for online estimation into an integrated CPU-GPU Dynamic Voltage Frequency Scaling (DVFS) governor. Our experiments on a test set of 20 mobile games exhibiting diverse characteristics show that our governor achieved significant energy efficiency gains of over 10% improvements on average in energy-per-frame with a surprising-but-modest 3% improvement in Frames-per-Second (FPS) performance, compared to a typical state-of-the-art governor that employs simple linear regression models.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130727418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julien Hascoët, B. Dinechin, P. G. D. Massas, Minh Quan Ho
{"title":"Asynchronous one-sided communications and synchronizations for a clustered manycore processor","authors":"Julien Hascoët, B. Dinechin, P. G. D. Massas, Minh Quan Ho","doi":"10.1145/3139315.3139318","DOIUrl":"https://doi.org/10.1145/3139315.3139318","url":null,"abstract":"Clustered manycore architectures fitted with a Network-on-Chip (NoC) and scratchpad memories enable highly energy-efficient and time-predictable implementations. However, porting applications to such processors represents a programming challenge. Inspired by supercomputer one-sided communication libraries and by OpenCL async_work_group_copy primitives, we propose a simple programming layer for communication and synchronization on clustered manycore architectures. We discuss the design and implementation of this layer on the 2nd-generation Kalray MPPA processor, where it is available from both OpenCL and POSIX C/C++ multithreaded programming models. Our measurements show that it allows to reach up to 94% of the theoretical hardware throughput with a best-case latency round-trip of 2.2μs when operating at 500 MHz.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127899379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Worst case delay analysis of shared resource access in partitioned multi-core systems","authors":"Donghyun Kang, Junchul Choi, S. Ha","doi":"10.1145/3139315.3139322","DOIUrl":"https://doi.org/10.1145/3139315.3139322","url":null,"abstract":"In the worst case response time (WCRT) analysis of multi-core systems with shared resources, non-deterministic arbitration delay due to resource contention should be considered conservatively. In this paper, we propose a novel technique for modeling the shared resource contention to find a more accurate upper bound of arbitration delay than the state-of-the-art technique. After computing the worst-case resource demand from each processing element based on the event stream model, we consider the possible scheduling pattern of tasks to make a tighter estimation. The performance of proposed technique is verified by extensive experiments with MediaBench benchmark applications, synthetic task sets, and a real-life automotive example.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134524969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mobile heterogeneous computing: a software perspective","authors":"T. Mitra","doi":"10.1145/3139315.3151619","DOIUrl":"https://doi.org/10.1145/3139315.3151619","url":null,"abstract":"Mobile heterogeneous computing, materialized in the form of multiprocessor system-on-chips (MPSoC) comprising of various processing elements such as general-purpose cores with differing characteristics, GPUs, DSPs, non-programmable accelerators, and reconfigurable computing, are expected to dominate the current and the future mobile platform landscape. The heterogeneity enables a computational kernel with specific requirements to be paired with the processing element(s) ideally suited to perform that computation, leading to substantially improved performance and energy-efficiency. While heterogeneous computing is an attractive proposition in theory, considerable software support at all levels is essential to fully realize its promises. The system software needs to orchestrate the different on-chip compute resources in a synergistic manner with minimal engagement from the application developers. The current state-of-the-art is inadequate in the software dimension despite tremendous progress and success in designing heterogeneous MPSoCs for mobile devices. This talk will put the spotlight on the software perspective of mobile heterogeneous computing, especially in the context of popular emerging applications, such as 3D gaming, multimedia processing and analytics. The talk will introduce the technology trends driving the mobile heterogeneous computing revolution, provide an overview of computationally and performance divergent compute elements, and present efforts at compiler and run-time management layers to unleash its potential towards high-performance energy-efficient computing.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126657169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliable mapping and partitioning of performance-constrained openCL applications on CPU-GPU MPSoCs","authors":"E. Wächter, G. Merrett, B. Al-Hashimi, A. Singh","doi":"10.1145/3139315.3157088","DOIUrl":"https://doi.org/10.1145/3139315.3157088","url":null,"abstract":"Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. Existing approaches exploit applications executing in CPU and GPU cores at the same time taking into account performance and energy consumption for mapping and partitioning. This paper presents a proposal for mapping and partitioning of applications in CPU-GPU MPSoCs taking into account the temperature behavior of the system. We evaluate the temperature profiling to partition the applications between CPU and GPU. The profiling is done by measuring the temperature of the CPU and GPU cores while executing different applications at different partitions. Results shown up to 13% savings of average temperature of the chip while maintaining performance requirements. A lower thermal behavior represents a better long-term reliability (lifetime) of the SoC.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128011179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating and mitigating degradation effects in multimedia circuits","authors":"H. Amrouch, J. Henkel","doi":"10.1145/3139315.3143527","DOIUrl":"https://doi.org/10.1145/3139315.3143527","url":null,"abstract":"The nano-CMOS era continuously introduces reliability challenges with every new generation. Short-term and long-term degradation effects due to temperature and aging, respectively, can cause a considerable increase in the delay of a circuit and hence timing errors due to path violations. To overcome such degradations, designers inevitably need to employ wide timing guardbands manifest as reduced efficiency and performance. In fact, narrowing guardbands is one of the key optimization goals in current and upcoming technology nodes. In this work, we investigate whether do designers really need to employ guardbands even in error-tolerant (e.g., multimedia) circuits? This investigation enables us to trade off guardbands with quality. In addition, we demonstrate how our proposed degradation-aware cell libraries, degradation-aware timing analysis and degradation-aware logic synthesis are indispensable, not only to link the physical level with the system level (i.e. quantifying the final impact of degradation effects on the quality of processed images) but also to increase effectively the resiliency of circuits against degradations.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122963871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate data reuse-based processor: a case study on image compression","authors":"Hisashi Osawa, Yuko Hara-Azumi","doi":"10.1145/3139315.3139316","DOIUrl":"https://doi.org/10.1145/3139315.3139316","url":null,"abstract":"In most embedded systems, how to design accelerators of end applications under stringent design constraints has been a crucial issue. In this paper, we employ a new computation paradigm \"approximate computing\" to resolve this issue. More specifically, our work focuses on and reuses computations which have recently produced results that are expected to be similar enough to the current ones - \"approximate data reuse.\" This concept enables to reduce computations by skipping instructions. We develop accelerator designs with this concept holistically from both hardware (architecture) and software (compilation) to achieve sufficient speedup and energy saving while mitigating the area overhead at the cost of some error. This paper provides mainly three contributions: architectural extensions applicable to a variety of processors even under a stringent constraint on circuit area, parameterization of important features of our method so that the degree of approximate data reuse can be easily tuned for different applications, and exhaustive evaluations on combinations of key parameters through our case study. A case study was quantitatively conducted using a realistic application (image compression) to demonstrate the effectiveness of our method over conventional ones.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130459502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ganapati Bhat, S. Srinivas, Vamsi Chagari, Jaehyun Park, Thomas McGiffen, Hyunseok Lee, D. Bliss, C. Chakrabarti, Ümit Y. Ogras
{"title":"Fluid wireless protocols: energy-efficient design and implementation","authors":"Ganapati Bhat, S. Srinivas, Vamsi Chagari, Jaehyun Park, Thomas McGiffen, Hyunseok Lee, D. Bliss, C. Chakrabarti, Ümit Y. Ogras","doi":"10.1145/3139315.3139321","DOIUrl":"https://doi.org/10.1145/3139315.3139321","url":null,"abstract":"We stand at the dawn of the next wireless revolution that is driven by 5G and internet-of-things technologies. The dramatic increase in the diversity of needs necessitates breaking the walls of rigid protocols. This paper introduces the concept of fluid wireless protocols, i.e., protocols that can change with the application requirements. We also present a protocol development kit to aid the design of these fluid protocols. Our tool set consists of a protocol recommendation engine for wireless communications and a hardware optimization framework for optimizing the implementation on a state-of-the-art system-on-chip platform. Specifically, we propose a hardware recommendation engine to generate an energy-efficient hardware implementation. We demonstrate the proposed techniques on four protocols with varying requirements, and also run air-to-air experiments on a commercial system-on-chip platform.","PeriodicalId":208026,"journal":{"name":"Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114829189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}