{"title":"Efficient mode changes in multi-mode systems","authors":"Akramul Azim, S. Fischmeister","doi":"10.1109/ICCD.2016.7753345","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753345","url":null,"abstract":"Multi-mode systems work in configurations, but face the challenge of ensuring timing guarantees during mode changes. In a multi-mode system, a mode-change request occurs when the system wants to operate in a new mode, but is already running in one. One mode may include some tasks that are same as that of another mode. Therefore, the new mode may have tasks that are same as the old mode. Changing modes in such a way to skip some already completed tasks can decrease the workload of the new mode. Traditional protocols for changing modes always look forward in time to schedule tasks, although using already completed tasks may avoid re-executing them in the new mode. Reusing common tasks reduces the time to re-execute them while switching modes. In this paper, we introduce the concept and design considerations for a mode-change technique that may use completed tasks stored in checkpoints to avoid unnecessary re-execution and facilitate faster execution of new mode tasks. Through an example case-study, experimental results demonstrate that the overhead of using checkpoints is low, and using rollback facilitates faster execution of new mode tasks if completed tasks stored in checkpoints can be reused.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115806706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CyHOP: A generic framework for real-time power-performance optimization in networked wearable motion sensors","authors":"Ramin Fallahzadeh, Hassan Ghasemzadeh","doi":"10.1109/ICCD.2016.7753320","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753320","url":null,"abstract":"Power consumption is a major obstacle in designing stringent resource constraint wearables. Several system-level design considerations contribute to energy consumption of these systems which must be taken into account while designing the system. We propose a power-performance optimization framework, namely CyHOP (Cyclic and Holistic Optimization framework), for connected wearable motion sensors. While existing work focus solely on one design parameter, our approach globally trades-off the performance of activity recognition and power consumption. CyHOP is capable of optimally adjusting the system to fulfill specific application needs. Using a smoothing technique, the initial multi-variate non-convex optimization problem is reduced to a convex problem and solved using our devised derivative-free optimization approach, namely, cyclic coordinate search. Our model performs a linear search by cycling through the system variables on each iteration until it converges to the global optimum. Using real-world data collected with wearable motion sensors during activity monitoring, we validate our approached with various performance thresholds ranging from 40% to 80%.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133008025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AIBA: An Automated Intra-cycle Behavioral Analysis for SystemC-based design exploration","authors":"Mehran Goli, Jannis Stoppe, R. Drechsler","doi":"10.1109/ICCD.2016.7753303","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753303","url":null,"abstract":"In order to overcome the ever increasing complexity of digital circuits, system design at the Electronic System Level (ESL) has become an area of active research. SystemC provides designers with a readily-available ESL framework, allowing them to design mixed hardware/software systems using a standardized C++ library. The analysis of the resulting designs is crucial to e.g. apply additional validation steps or assist designers during the development process. Existing approaches focus on the extraction of static information, providing designers with models that describe the structure of their system but not its behavior. In this paper, we introduce the Automated Intra-cycle Behavioral Analysis tool, AIBA. AIBA utilizes the GNU debugger to execute a two-step analysis that retrieves behavioral and architectural information of ESL designs. The proposed method is completely non-intrusive, allowing both SystemC designs and the standard tool flow to be used without any modification. Case studies confirm the benefits of the approach.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122007863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirhossein Mirhosseini, Mohammad Sadrosadati, Maryam Zare, H. Sarbazi-Azad
{"title":"Quantifying the difference in resource demand among classic and modern NoC workloads","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, Maryam Zare, H. Sarbazi-Azad","doi":"10.1109/ICCD.2016.7753314","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753314","url":null,"abstract":"This paper quantifies the difference in resource demand between modern and classic NoC workloads. In the paper, we show that modern workloads are able to better utilize higher numbers of VCs and smaller C factors in order to attain performance and energy efficiency. This is because of the high throughput and possible local congestions in their traffic pattern. As a result, such workloads are more suitable for concurrency and redundancy energy reduction techniques where the voltage and frequency are reduced simultaneously and the increased power budget is used for introducing additional resources to the network in order to improve the performance.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122042643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and alleviating intra-die and intra-DIMM parameter variation in the memory system","authors":"Meysam Taassori, Ali Shafiee, R. Balasubramonian","doi":"10.1109/ICCD.2016.7753283","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753283","url":null,"abstract":"Continued process scaling must overcome several manufacturing challenges. At the same time, industry is exploring many new memory technologies that require new manufacturing processes. In such challenging fabrication regimes, parameter variation (PV) and yield will be important problems. While many recent bodies of work have targeted PV in processors, few have targeted PV in the memory system. Mitigation techniques have either focused on refresh, or have focused on inter-die variation. In this work, with empirical measurements, we first show that PV and specifically intra-die PV is indeed a real phenomenon in modern DRAM chips. We show that this intra-die PV can impact timing parameters for different banks within a DRAM chip. In response to growing PV, memory timing parameters will likely be set very conservatively to accommodate the worst case. To overcome these worst-case limitations, we propose the design of a reconfigurable memory module that detects PV in the field and organizes the memory system into fast/slow regions. This requires changes to the memory controller and to buffer chips on DIMMs. Further, OS migration policies can move frequently accessed pages to the fast regions. This overall approach not only improves performance and energy, it also provides a configurable platform for systems that can tolerate errors or approximation. The proposed system yields an average performance improvement of 12.6% in DRAM systems, and 25.5% in NVM systems.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130403498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speculative path power estimation using trace-driven simulations during high-level design phase","authors":"Saumya Chandra, R. Jayaseelan, Ravi Bhargava","doi":"10.1109/ICCD.2016.7753350","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753350","url":null,"abstract":"Today power is an important design metric and the ongoing goal of microprocessor designers is to maximize performance within specified power targets. The key to achieving this goal is the ability to accurately estimate power and performance design points of future products during the high-level micro-architectural design phase (HLD). These estimates are heavily used for feature analysis and product feasibility studies. Most performance and power simulators across the industry use the trace-driven simulation model (TDM) as opposed to an execution driven model (EDM). This is because, in general, trace-driven models: (i) have faster turnaround time; (ii) require significantly lower resources in terms of disk space, CPU time and memory footprint; and (iii) are more robust, portable and well understood. However, TDM simulations lack the ability to accurately capture the flow of speculative path (or wrong path) 1 execution following a branch mispredict in an out-of-order processor pipeline. This leads to inaccuracies in power and performance estimates. On the other hand, in the EDM method, input is an executable and the model can fetch and execute instructions down the speculative path on a branch mispredict. As such it enables us to accurately account for the impact of the speculative path activity. However, it is slower, prone to failures, and has much higher development and validation effort. In this paper we compare and analyze performance and power estimates from TDM and EDM simulations for the same workload phases. We observe that the impact of wrong path on power estimates is significantly higher than on the performance estimates. Using results from our analysis, we develop a methodology to account for power consumption during wrong path execution in TDM simulations. We show that the proposed methodology can provide power estimates approaching EDM-based accuracy while not sacrificing the speed and flexibility of the trace-driven models.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keni Qiu, Junpeng Luo, Zhiyao Gong, Wei-gong Zhang, Jing Wang, Yuanchao Xu, Tao Li, C. Xue
{"title":"Refresh-aware loop scheduling for high performance low power volatile STT-RAM","authors":"Keni Qiu, Junpeng Luo, Zhiyao Gong, Wei-gong Zhang, Jing Wang, Yuanchao Xu, Tao Li, C. Xue","doi":"10.1109/ICCD.2016.7753282","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753282","url":null,"abstract":"The highlighted advantages of low leakage power, high storage density and immunity to electronic magnetic radiation make STT-RAM a promising candidate to build cache, SPM or main memory in embedded systems. However, write operations on STT-RAM have considerably longer latency and higher energy consumption than conventional SRAM. To solve this problem, researchers have proposed to relax STT-RAM's non-volatility and to have it work in a fast and low power mode. Under this volatile mode, refresh operations are needed to guarantee data correctness if their lifespan is larger than the retention time. It is observed that this refresh overhead is significant for data in stencil loops with the characteristic of constant read and write dependencies. This paper proposes a loop scheduling technique which can traverse loops in a new direction such that data lifespan can be greatly shortened. Therefore, overall refresh overhead can be efficiently mitigated so as to improve performance and reduce power consumption. The experimental results indicate that access latency and dynamic energy can be improved by 21.4~96.0% and 22.0~95.5% respectively by the proposed scheduling scheme.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123933871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hasan Erdem Yantır, M. Fouda, A. Eltawil, F. Kurdahi
{"title":"Process variations-aware resistive associative processor design","authors":"Hasan Erdem Yantır, M. Fouda, A. Eltawil, F. Kurdahi","doi":"10.1109/ICCD.2016.7753260","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753260","url":null,"abstract":"Recent breakthroughs in memristive devices have demonstrated the potential of using resistive content addressable memories for associative processing. These architectures enable ultra-high density integrated circuits along with low-power computation. However, the reliability of memristive elements is limiting the widespread adoption of these architectures. In this study, we address the reliability issues that arise in high density, resistive associative processor architectures. We propose methods to design process variation immune resistive content addressable memories and minimize the error probabilities. According to SPICE-based circuit simulations, the reliability of the circuit increases significantly and thus positively influences the accuracy of arithmetic operations as well.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124269635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy aware routing of multi-level Network-on-Chip traffic","authors":"Vasil Pano, I. Yilmaz, A. More, B. Taskin","doi":"10.1109/ICCD.2016.7753330","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753330","url":null,"abstract":"The emergence of Network-on-Chip (NoC) as a communication paradigm for Multi-Processor System-on-Chips (MPSoCs) significantly exacerbates the need to provide a methodology that optimizes the energy consumption of the overall system. This is especially important when factoring in current Network-on-Chip advances which have multiple communication media such as on-chip wireless or nano-photonics links, hybrid with traditional wired links. All of these media have different energy profiles, and if not taken into consideration the system will incur a higher power consumption throughout the runtime of the application. In this work, the case for EDP (energy-delay product) optimization between different levels of a multi-level Network-on-Chip is presented. Using a dynamic, energy aware algorithm, the EDP improvement is compared to a multi-level Network-on-Chip using a statically optimized routing. The proposed routing algorithm handles the different types of energy-delay profiles of multiple links. The end product is a methodology that lowers the overall energy consumption by optimizing the energy profile of the Network-on-Chip while also minimizing the network delay.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115029128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xian Zhu, Mihir Awatramani, D. Rover, Joseph Zambreno
{"title":"ONAC: Optimal number of active cores detector for energy efficient GPU computing","authors":"Xian Zhu, Mihir Awatramani, D. Rover, Joseph Zambreno","doi":"10.1109/ICCD.2016.7753335","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753335","url":null,"abstract":"Graphics Processing Units (GPUs) have become a prevalent platform for high throughput general purpose computing. The peak computational throughput of GPUs has been steadily increasing with each technology node by scaling the number of cores on the chip. Although this vastly improves the performance of several compute-intensive applications, our experiments show that some applications can achieve peak performance without utilizing all cores on the chip. We refer to the number of cores at which performance of an application saturates as the optimal number of active cores (Nopt). We propose executing the application on Nopt cores, and power-gating the unused cores to reduce static power consumption. Towards this target, we present ONAC (Optimal Number of Active Cores detector), a runtime technique to detect Nopt. ONAC uses a novel estimation model, which significantly reduces the number of hardware samples taken to detect the optimal core count, compared to a sequential detection technique (Seq-Det). We implement ONAC and Seq-Det in a cycle-level GPU performance simulator and analyze their effect on performance, power and energy. Our evaluation shows that ONAC and Seq-Det can reduce energy consumption by 20% and 10% on average for memory-intensive applications, without sacrificing more than 2% performance. The higher energy savings for ONAC comes from reducing the detection time by 45% as compared to Seq-Det.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114386236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}