Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, Jenq-Kuen Lee
{"title":"Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files","authors":"Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, Jenq-Kuen Lee","doi":"10.1145/3133218","DOIUrl":"https://doi.org/10.1145/3133218","url":null,"abstract":"A modern GPU can simultaneously process thousands of hardware threads. These threads are grouped into fixed-size SIMD batches executing the same instruction on vectors of data in a lockstep to achieve high throughput and performance. The register files are huge due to each SIMD group accessing a dedicated set of vector registers for fast context switching, and consequently the power consumption of register files has become an important issue. One proposed solution is to replace some of the vector registers by scalar registers, as different threads in a same SIMD group operate on scalar values and so the redundant computations and accesses of these scalar values can be eliminated. However, it has been observed that a significant number of registers containing affine vectors υ such that υ[i] = b + i × s can be represented by base b and stride s. Therefore, this article proposes an affine register file design for GPUs that is energy efficient due to it reducing the redundant executions of both the uniform and affine vectors. This design uses a pair of registers to store the base and stride of each affine vector and provides specific affine ALUs to execute affine instructions. A method of compiler analysis has been developed to detect scalars and affine vectors and annotate instructions for facilitating their corresponding scalar and affine computations. Furthermore, a priority-based register allocation scheme has been implemented to assign scalars and affine vectors to appropriate scalar and affine register files. Experimental results show that this design was able to dispatch 43.56% of the computations to scalar and affine ALUs when using eight scalar and four affine registers per warp. This resulted in the current design also reducing the energy consumption of the register files and ALUs to 21.86% and 26.54%, respectively, and it reduced the overall energy consumption of the GPU by an average of 5.18%.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"17 1","pages":"18:1-18:25"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81471221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michail Noltsis, D. Rodopoulos, N. Zompakis, F. Catthoor, D. Soudris
{"title":"Runtime Slack Creation for Processor Performance Variability using System Scenarios","authors":"Michail Noltsis, D. Rodopoulos, N. Zompakis, F. Catthoor, D. Soudris","doi":"10.1145/3152158","DOIUrl":"https://doi.org/10.1145/3152158","url":null,"abstract":"Modern microprocessors contain a variety of mechanisms used to mitigate errors in the logic and memory, referred to as Reliability, Availability, and Serviceability (RAS) techniques. Many of these techniques, such as component disabling, come at a performance cost. With the aggressive downscaling of device dimensions, it is reasonable to expect that chip-wide error rates will intensify in the future and perhaps vary throughout system lifetime. As a result, it is important to reclaim the temporal RAS overheads in a systematic way and enable dependable performance. The current article presents a closed-loop control scheme that actuates processor’s frequency based on detected timing interference to ensure performance dependability. The concepts of slack and deadline vulnerability factor are introduced to support the formulation of a discrete time control problem. Default application timing is derived using the system scenario methodology, the applicability of which is demonstrated through simulations. Additionally, the proposed concept is demonstrated on a real platform and application: a Proportional-Integral-Differential controller, implemented within the application, actuates the Dynamic Voltage and Frequency Scaling (DVFS) framework of the Linux kernel to effectively reclaim temporal overheads injected at runtime. The current article discusses the responsiveness and energy efficiency of the proposed performance dependability scheme. Finally, additional formulation is introduced to predict the upper bound of timing interference that can be absorbed by actuating the DVFS of any processor and is also validated on a representative reduction to practice.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"36 1","pages":"24:1-24:23"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81418577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible and Tradeoff-Aware Constraint-Based Design Space Exploration for Streaming Applications on Heterogeneous Platforms","authors":"Kathrin Rosvall, I. Sander","doi":"10.1145/3133210","DOIUrl":"https://doi.org/10.1145/3133210","url":null,"abstract":"Due to its complexity, the problem of mapping and scheduling streaming applications on heterogeneous MPSoCs under real-time and performance constraints has traditionally been tackled by incomplete heuristic algorithms. In recent years, approaches based on Constraint Programming (CP) have shown promising results as complete methods for finding optimal mappings, in particular concerning throughput. However, so far none of the available CP approaches consider the tradeoff between throughput and buffer requirements or throughput and power consumption. This article integrates tradeoff awareness into the CP model and introduces a two-step solving approach that utilizes the advantages of heuristics, while still keeping the completeness property of CP. With a number of experiments considering several streaming applications and different platform models, the article illustrates not only the efficiency of the presented model but also its suitability for solving different problems with various combinations of performance constraints.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"19 1","pages":"21:1-21:26"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72665097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maryam Shafiee, Navankur Beohar, Priyanka Bakliwal, Sidhanto Roy, Debashis Mandal, B. Bakkaloglu, S. Ozev
{"title":"A Disturbance-Free Built-In Self-Test and Diagnosis Technique for DC-DC Converters","authors":"Maryam Shafiee, Navankur Beohar, Priyanka Bakliwal, Sidhanto Roy, Debashis Mandal, B. Bakkaloglu, S. Ozev","doi":"10.1145/3152157","DOIUrl":"https://doi.org/10.1145/3152157","url":null,"abstract":"Complex electronic systems include multiple power domains and drastically varying dynamic power consumption patterns, requiring the use of multiple power conversion and regulation units. High-frequency switching converters have been gaining prominence in the DC-DC converter market due to their high efficiency and smaller form factor. Unfortunately, they are also subject to higher process variations, and faster in-field degradation, jeopardizing stable operation of the power supply. This article presents a technique to track changes in the dynamic loop characteristics of DC-DC converters without disturbing the normal mode of operation using a white noise–based excitation and correlation. Using multiple points for injection and analysis, we show that the degraded part can be diagnosed to take remedial action. White noise excitation is generated via a pseudo-random disturbance at reference, load current, and pulse-width modulation (PWM) nodes of the converter with the test signal energy being spread over a wide bandwidth, without significantly affecting the converter noise and ripple floor. The impulse response is extracted by correlating the random input sequence with the disturbed output generated. Test signal analysis is achieved by correlating the pseudo-random input sequence with the output response and thereby accumulating the desired behavior over time and pulling it above the noise floor of the measurement set-up. An off-the-shelf power converter, LM27402, is used as the device-under-test (DUT) for experimental verification. Experimental results show that the proposed technique can estimate converter natural frequency and quality factor (Q-factor) within ±2.5% and ±0.7% error margin respectively, over changes in load inductance and capacitance. For the diagnosis purpose, a measure of inductor's DC resistance (DCR) value, which is the inductor's series resistance and indicative of the degradation in inductor's Q-factor, is estimated within less than ±1.6% error margin.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"18 1","pages":"25:1-25:22"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89138158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Yang, Haoting Shen, Domenic Forte, S. Bhunia, M. Tehranipoor
{"title":"Hardware-Enabled Pharmaceutical Supply Chain Security","authors":"Kun Yang, Haoting Shen, Domenic Forte, S. Bhunia, M. Tehranipoor","doi":"10.1145/3144532","DOIUrl":"https://doi.org/10.1145/3144532","url":null,"abstract":"The pharmaceutical supply chain is the pathway through which prescription and over-the-counter (OTC) drugs are delivered from manufacturing sites to patients. Technological innovations, price fluctuations of raw materials, as well as tax, regulatory, and market demands are driving change and making the pharmaceutical supply chain more complex. Traditional supply chain management methods struggle to protect the pharmaceutical supply chain, maintain its integrity, enhance customer confidence, and aid regulators in tracking medicines. To develop effective measures that secure the pharmaceutical supply chain, it is important that the community is aware of the state-of-the-art capabilities available to the supply chain owners and participants. In this article, we will be presenting a survey of existing hardware-enabled pharmaceutical supply chain security schemes and their limitations. We also highlight the current challenges and point out future research directions. This survey should be of interest to government agencies, pharmaceutical companies, hospitals and pharmacies, and all others involved in the provenance and authenticity of medicines and the integrity of the pharmaceutical supply chain.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"50 1","pages":"23:1-23:26"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89382696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DYNASCORE: DYNAmic Software COntroller to Increase REsource Utilization in Mixed-Critical Systems","authors":"A. Kritikakou, T. Marty, Matthieu Roy","doi":"10.1145/3110222","DOIUrl":"https://doi.org/10.1145/3110222","url":null,"abstract":"In real-time mixed-critical systems, Worst-Case Execution Time (WCET) analysis is required to guarantee that timing constraints are respected—at least for high-criticality tasks. However, the WCET is pessimistic compared to the real execution time, especially for multicore platforms. As WCET computation considers the worst-case scenario, it means that whenever a high-criticality task accesses a shared resource in multicore platforms, it is considered that all cores use the same resource concurrently. This pessimism in WCET computation leads to a dramatic underutilization of the platform resources, or even failing to meet the timing constraints. In order to increase resource utilization while guaranteeing real-time guarantees for high-criticality tasks, previous works proposed a runtime control system to monitor and decide when the interferences from low-criticality tasks cannot be further tolerated. However, in the initial approaches, the points where the controller is executed were statically predefined. In this work, we propose a dynamic runtime control which adapts its observations to online temporal properties, further increasing the dynamism of the approach, and mitigating the unnecessary overhead implied by existing static approaches. Our dynamic adaptive approach allows one to control the ongoing execution of tasks based on runtime information, and further increases the gains in terms of resource utilization compared with static approaches.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"62 11 1","pages":"13:1-13:26"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89232847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Congming Gao, Liang Shi, Yejia Di, Qiao Li, C. Xue, Kaijie Wu, E. Sha
{"title":"Exploiting Chip Idleness for Minimizing Garbage Collection—Induced Chip Access Conflict on SSDs","authors":"Congming Gao, Liang Shi, Yejia Di, Qiao Li, C. Xue, Kaijie Wu, E. Sha","doi":"10.1145/3131850","DOIUrl":"https://doi.org/10.1145/3131850","url":null,"abstract":"Solid state drives (SSDs) are normally constructed with a number of parallel-accessible flash chips, where host I/O requests are processed in parallel. In addition, there are many internal activities in SSDs, such as garbage collection and wear leveling induced read, write, and erase operations, to solve the issues of inability of in-place updates and limited lifetime. When internal activities are triggered on a chip, the chip will be blocked. Our preliminary studies on several workloads show that when internal activities are frequently triggered, the host I/O performance will be significantly impacted because of the access conflict between them. In this work, in order to improve the access conflict induced performance degradation, a novel access conflict minimization scheme is proposed. The basic idea of the scheme is motivated by an interesting observation in SSDs: several chips are idle when other chips are busy with internal activities and host I/O requests. Based on this observation, we propose to schedule internal activities induced operations for minimized access conflict by exploiting the idleness of the multiple chips of SSDs. This approach is realized by two steps: First, read internal activities accessed data to the controller; second, by exploiting the idle chips during internal activities, write internal activities accessed data back to these idle chips. With this scheme, the internal activities can be processed with minimized access conflict to the host requests. Simulation results show that the proposed approach significantly reduces the access conflict, and in turn leads to a significant performance improvement of SSDs.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"20 1","pages":"15:1-15:29"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74093826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaeyung Jun, Kyu Hyun Choi, Hokwon Kim, Sang Ho Yu, S. Kim, Youngsun Han
{"title":"Recovering from Biased Distribution of Faulty Cells in Memory by Reorganizing Replacement Regions through Universal Hashing","authors":"Jaeyung Jun, Kyu Hyun Choi, Hokwon Kim, Sang Ho Yu, S. Kim, Youngsun Han","doi":"10.1145/3131241","DOIUrl":"https://doi.org/10.1145/3131241","url":null,"abstract":"Recently, scaling down dynamic random access memory (DRAM) has become more of a challenge, with more faults than before and a significant degradation in yield. To improve the yield in DRAM, a redundancy repair technique with intra-subarray replacement has been extensively employed to replace faulty elements (i.e., rows or columns with defective cells) with spare elements in each subarray. Unfortunately, such technique cannot efficiently handle a biased distribution of faulty cells because each subarray has a fixed number of spare elements. In this article, we propose a novel redundancy repair technique that uses a hashing method to solve this problem. Our hashing technique reorganizes replacement regions by changing the way in which their replacement information is referred, thus making faulty cells become evenly distributed to the regions. We also propose a fast repair algorithm to find the best hash function among all possible candidates. Even if our approach requires little hardware overhead, it significantly improves the yield when compared with conventional redundancy techniques. In particular, the results of our experiment show that our technique saves spare elements by about 57% and 55% for a yield of 99% at BER 1e-6 and 5e-7, respectively.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"66 1","pages":"16:1-16:21"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81376622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Objective 3D Floorplanning with Integrated Voltage Assignment","authors":"J. Knechtel, J. Lienig, I. Elfadel","doi":"10.1145/3149817","DOIUrl":"https://doi.org/10.1145/3149817","url":null,"abstract":"Voltage assignment is a well-known technique for circuit design, which has been applied successfully to reduce power consumption in classical 2D integrated circuits (ICs). Its usage in the context of 3D ICs has not been fully explored yet although reducing power in 3D designs is of crucial importance, for example, to tackle the ever-present challenge of thermal management. In this article, we investigate the effective and efficient partitioning of 3D designs into multiple voltage domains during the floorplanning step of physical design. In particular, we introduce, implement, and evaluate novel algorithms for effective integration of voltage assignment into the inner floorplanning loops. Our algorithms are compatible not only with the traditional objectives of 2D floorplanning but also with the additional objectives and constraints of 3D designs, including the planning of through-silicon vias (TSVs) and the thermal management of stacked dies. We test our 3D floorplanner extensively on the GSRC benchmarks as well as on an augmented version of the IBM-HB+ benchmarks. The 3D floorplans are shown to achieve effective trade-offs for power and delays throughout different configurations—our results surpass naïve low-power and high-performance voltage assignment by 17% and 10%, on average. Finally, we release our 3D floorplanning framework as open-source code.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"134 1","pages":"22:1-22:27"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80169493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Emeretlis, G. Theodoridis, P. Alefragis, N. Voros
{"title":"Static Mapping of Applications on Heterogeneous Multi-Core Platforms Combining Logic-Based Benders Decomposition with Integer Linear Programming","authors":"A. Emeretlis, G. Theodoridis, P. Alefragis, N. Voros","doi":"10.1145/3133219","DOIUrl":"https://doi.org/10.1145/3133219","url":null,"abstract":"The proper mapping of an application on a multi-core platform and the scheduling of its tasks are key elements to achieve the maximum performance. In this article, a novel hybrid approach based on integrating the Logic-Based Benders Decomposition (LBBD) principle with a pure Integer Linear Programming (ILP) model is introduced for mapping applications described by Directed Acyclic Graphs (DAGs) on platforms consisting of heterogeneous cores. The LBBD approach combines two optimization techniques with complementary strengths, namely ILP and Constraint Programming (CP), and is employed as a cut generation scheme. The generated constraints are utilized by the ILP model to cut possible assignment combinations aiming at improving the solution or proving the optimality of the best-found one. The introduced approach was applied both on synthetic DAGs and on DAGs derived from real applications. Through the proposed approach, many problems were optimally solved that could not be solved by any of the above methods (ILP, LBBD) alone within a time limit of 2 hours, while the overall solution time was also significantly decreased. Specifically, the hybrid method exhibited speedups equal to 4.2× for the synthetic instances and 10× for the real-application DAGs over the LBBD approach and two orders of magnitude over the ILP model.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"1 1","pages":"26:1-26:24"},"PeriodicalIF":0.0,"publicationDate":"2017-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90550592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}