Sheng Yang, R. Shafik, G. Merrett, Edward A. Stott, Joshua M. Levine, James J. Davis, B. Al-Hashimi
{"title":"Adaptive energy minimization of embedded heterogeneous systems using regression-based learning","authors":"Sheng Yang, R. Shafik, G. Merrett, Edward A. Stott, Joshua M. Levine, James J. Davis, B. Al-Hashimi","doi":"10.1109/PATMOS.2015.7347594","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347594","url":null,"abstract":"Modern embedded systems consist of heterogeneous computing resources with diverse energy and performance trade-offs. This is because these resources exercise the application tasks differently, generating varying workloads and energy consumption. As a result, minimizing energy consumption in these systems is challenging as continuous adaptation between application task mapping (i.e. allocating tasks among the computing resources) and dynamic voltage/frequency scaling (DVFS) is required. Existing approaches have limitations due to lack of such adaptation with practical validation (Table I). This paper addresses such limitation and proposes a novel adaptive energy minimization approach for embedded heterogeneous systems. Fundamental to this approach is a runtime model, generated through regression-based learning of energy/performance trade-offs between different computing resources in the system. Using this model, an application task is suitably mapped on a computing resource during runtime, ensuring minimum energy consumption for a given application performance requirement. Such mapping is also coupled with a DVFS control to adapt to performance and workload variations. The proposed approach is designed, engineered and validated on a Zynq-ZC702 platform, consisting of CPU, DSP and FPGA cores. Using several image processing applications as case studies, it was demonstrated that our proposed approach can achieve significant energy savings (>70%), when compared to the existing approaches.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"171 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114017906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A versatile and reliable glitch filter for clocks","authors":"Robert Najvirt, A. Steininger","doi":"10.1109/PATMOS.2015.7347599","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347599","url":null,"abstract":"In today's complex system-on-chip architectures the protection of the clock(s) against glitches introduced by environmental disturbances, attackers, or gating measures is becoming increasingly important. Glitch protection is a delicate issue in the digital domain, as it is inherently coupled with metastability issues. The circuit we propose in this paper outputs a clock that strictly follows an input reference clock in the regular case, but guarantees a minimum output pulse width even in case of arbitrary behavior of the reference. We will give a thorough analysis showing that, unlike most existing solutions, our circuit can handle metastability without any residual risk of upsets. Still its implementation is very simple. Our theoretical claims will be supported by simulation results. Furthermore, we will give some examples on possible use cases for such a circuit, like clock gating, clock self-repair, or defense against clock attacks.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133367304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Keliris, Vasilis Dimitsas, O. Kremmyda, D. Gizopoulos, M. Maniatakos
{"title":"Efficient parallelization of the Discrete Wavelet Transform algorithm using memory-oblivious optimizations","authors":"A. Keliris, Vasilis Dimitsas, O. Kremmyda, D. Gizopoulos, M. Maniatakos","doi":"10.1109/PATMOS.2015.7347583","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347583","url":null,"abstract":"As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers and industry have shifted their interest towards multi-core and many-core architectures for improving performance. Comparisons between optimized applications for parallel architectures have been quantified many times in the literature, but contradictory results have been reported mainly due to biased methods of evaluating and comparing these architectures. In this paper, we present memory-oblivious optimizations of the widely used Discrete Wavelet Transform (DWT), and provide detailed comparisons of the algorithm on Intel and AMD multi-core CPUs, Nvidia many-core GPUs, as well as the Intel's Xeon Phi many-core coprocessor. Our results indicate that, compared to their respective non-optimized single thread implementations, memory-oblivious optimization delivers up to 17.9×-197.2× performance improvement for the various architectures examined. Furthermore, compared to the state-of-the-art, the presented CPU and GPU memory-oblivious implementations are 2.6× and 1.3× faster respectively than the fastest implementations of DWT currently available in the literature. No comparison to the state-of-the-art can be made for the Xeon Phi, as, to the best of our knowledge, this is the first study that optimizes the DWT for this newfangled architecture.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128885572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bao Le, Djordje Maksimovic, D. Sengupta, Erhan Ergin, Ryan Berryhill, A. Veneris
{"title":"Constructing stability-based clock gating with hierarchical clustering","authors":"Bao Le, Djordje Maksimovic, D. Sengupta, Erhan Ergin, Ryan Berryhill, A. Veneris","doi":"10.1109/PATMOS.2015.7347593","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347593","url":null,"abstract":"In modern designs, a complex clock distribution network is employed to distribute the clock signal(s) to all the sequential elements. As the functionality of these sequential elements depends heavily on usage scenarios, it is vital that the clock network is optimized for these scenarios. This paper introduces a clock network power optimization methodology based on design usage patterns and stability based clock gating. Specifically, whenever a register retains its value from the previous cycle, a clock gating implementation shuts off its clock and disables data loading to enable power reduction. We first introduce the notion of a stability pattern and its correlation with clock gating efficiency. Next, we introduce a methodology to identify efficient clock gating implementations. In this framework, a clustering algorithm leveraging stability patterns iteratively computes more effective gating implementations. Each implementation is evaluated further on area overhead and critical path delay. If it satisfies all criteria, it is implemented in the design; otherwise, it is sent back to the clustering algorithm to compute new clock gating implementations. Empirical results show 22.6% reduction in clock network power and 16.0% reduction in total power consumption. This confirms the practicality and robustness of the proposed methodology.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126401840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inferring custom architectures from OpenCL","authors":"Krzysztof Kepa, Ritesh Soni, P. Athanas","doi":"10.1109/PATMOS.2015.7347581","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347581","url":null,"abstract":"OpenCL has emerged as the de facto cross-platform standard in the GPU-based HPC computing domain. However, in FPGA-based HPC systems, OpenCL-to-FPGA compilers often yield suboptimal results due to the rigid architecture, limited shared-memory, and non-existent inter-work-item communication pathways implied by the OpenCL model. In this work, a methodology of inferring application-specific OpenCL “work-item” interfaces based on kernel code analysis is explored. A proof-of-concept prototype is implemented using an OpenCL source-to-source translator, which allows automated generation of the FPGA-based hardware accelerators directly from the OpenCL sources. The type and implementation of the inferred interface is tailored to match the data access patterns within the kernel. The inferred interface outperforms limitations of the OpenCL rigid architecture and communication model. The presented approach achieves a ~30x speedup over the generic memory-based approach for a 16 work-items application. A set of OpenCL coding patterns targeting FPGA-based HPC systems is also introduced. This technique is demonstrated on a popular bioinformatics algorithm, yet is applicable to any such algorithm with non-standard inter-cell communications.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132096907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Calculation of worst-case execution time for multicore processors using deterministic execution","authors":"Hamid Mushtaq, Z. Al-Ars, K. Bertels","doi":"10.1109/PATMOS.2015.7347584","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347584","url":null,"abstract":"Safety critical real time systems need to meet strict timing deadlines. We use a model checking based approach to calculate the WCET, where we apply optimizations to reduce the number of states stored by the model checker. Furthermore, we used deterministic shared memory accesses to further reduce calculation time, memory and number of states needed for calculating WCET. By optimizing the model checking code, we were able to complete benchmarks which otherwise were having state explosion problems. Furthermore, by using deterministic execution, we significantly reduced the calculation time (up to 158×), memory (up to 89×) and states needed (up to 188×) for calculating WCET with a negligible increase (up to 4%) in the calculated WCET for a multicore system with 4 cores. Lastly, unlike other state-of-the-art approaches, that perform binary search to search the WCET by running several iterations, our method calculates WCET in just one iteration. Taking all these optimizations into consideration, the gain in speed was from 1775× to 2471× for 4 threads.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126661343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frequency-domain modeling of ground bounce and substrate noise for synchronous and GALS systems","authors":"M. Babić, Xin Fan, M. Krstic","doi":"10.1109/PATMOS.2015.7347597","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347597","url":null,"abstract":"In this work, the ground bounce noise has been modeled and analyzed in frequency domain, for both synchronous and GALS (globally asynchronous, locally synchronous) systems. The analysis has been performed analytically, and validated by numerical simulations in MATLAB. Package parasitics and power distribution network have been coarsely modeled by a simple lumped model, while switching currents have been modeled as periodic triangular pulses. Dominant components of spectrum are determined, and the impact of their distribution on the requirements for substrate modeling has been discussed. It has been concluded that resistive substrate approximation introduces large errors for systems with small decoupling capacitances, while it can be satisfactory for systems with large decoupling capacitances.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"17 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130920077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Canals, A. Morro, A. Oliver, M. Alomar, J. Rosselló
{"title":"An unconventional computing technique for ultra-fast and ultra-low power data mining","authors":"V. Canals, A. Morro, A. Oliver, M. Alomar, J. Rosselló","doi":"10.1109/PATMOS.2015.7347585","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347585","url":null,"abstract":"In this work we review the basic principles of stochastic logic and propose its application to probabilistic-based pattern-recognition analysis. The proposed technique is the implementation of a parallel comparison of data with respect to various pre-stored categories. We design smart pulse-based stochastic-logic blocks to provide an efficient pattern recognition analysis. The proposed architecture can speed-up the screening process of huge databases by two orders of magnitude with respect classical software-based solutions, thus implying a great improvement in terms of total performance (speed and power dissipation).","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123066317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Altieri, S. Lesecq, D. Puschini, O. Héron, E. Beigné, J. Rodas
{"title":"Evaluation and mitigation of aging effects on a digital on-chip voltage and temperature sensor","authors":"M. Altieri, S. Lesecq, D. Puschini, O. Héron, E. Beigné, J. Rodas","doi":"10.1109/PATMOS.2015.7347595","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347595","url":null,"abstract":"Power efficiency is a tremendous challenge for high performance embedded systems under energy constraints. Fine grain Dynamic Voltage and Frequency Scaling approaches are usually implemented in order to meet these conflicting objectives. Moreover, these techniques can be improved if local and on-the-fly monitoring of the dynamic variations is performed. A low-cost onchip general purpose sensor associated with an appropriate data fusion technique has been recently developed in order to monitor local temperature and voltage conditions. However, reliability has become a major concern as the technology scales below 40nm. The aging variation is not anymore negligible and must be taken into account during the monitor design and operation. This paper revisits such a sensor under both BTI and HCI aging effects in 28nm STMicroelectronics technology. A simple recalibration method is also proposed to mitigate the aging effects on the VT estimation.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127605468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emilie Garat, David Coriat, E. Beigné, L. Stefanazzi
{"title":"Unified Power Format (UPF) methodology in a vendor independent flow","authors":"Emilie Garat, David Coriat, E. Beigné, L. Stefanazzi","doi":"10.1109/PATMOS.2015.7347591","DOIUrl":"https://doi.org/10.1109/PATMOS.2015.7347591","url":null,"abstract":"To provide designers with an efficient low power design flow, several methodologies have been proposed such as the Unified Power Format (UPF). The main issue faced by designers is the non-interoperability of those methods across different Computer Aided Design (CAD) tools. Although the UPF standard was originally created with interoperability in mind, few of its constructs are actually supported by all CAD vendors. In this paper, we aim at providing a UPF 2.0 methodology that is compatible with different tools. The proposed case study is a circuit with three power domains and a cross-vendor UPF specification. This paper demonstrates a full low power design flow, with formal power checking, power aware simulation, synthesis and back-end.","PeriodicalId":325869,"journal":{"name":"2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132035696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}