Yan Sui, Chun Yang, Dong Tong, Xianhua Liu, Xu Cheng
{"title":"MFAP: Fair Allocation between fully backlogged and non-fully backlogged applications","authors":"Yan Sui, Chun Yang, Dong Tong, Xianhua Liu, Xu Cheng","doi":"10.1109/ICCD.2016.7753343","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753343","url":null,"abstract":"In this paper, we consider the problem of ensuring fairness in systems serving a mixture of fully backlogged applications, which continuously demand resources, and non-fully backlogged applications. We introduce a fairness metric, called interference fairness, the basic idea underlying which is that the interference caused by application A for another application B should be equal to that caused by B for A. To effectively and efficiently guarantee this fairness metric, we propose Mutual Fair Allocation Policy (MFAP), a simple and powerful resource sharing policy, and show how it guarantees interference fairness between any pair of applications. We also show that MFAP, unlike other viable policies, satisfies several highly desirable properties, including some from game theory, as well as common sense intuitions. As a use case, we implemented MFAP on a disk scheduling framework. The experimental results based on synthetic and real workloads show how our implementation achieved interference fairness and improved non-fully backlogged applications performance.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126607219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AIBA: An Automated Intra-cycle Behavioral Analysis for SystemC-based design exploration","authors":"Mehran Goli, Jannis Stoppe, R. Drechsler","doi":"10.1109/ICCD.2016.7753303","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753303","url":null,"abstract":"In order to overcome the ever increasing complexity of digital circuits, system design at the Electronic System Level (ESL) has become an area of active research. SystemC provides designers with a readily-available ESL framework, allowing them to design mixed hardware/software systems using a standardized C++ library. The analysis of the resulting designs is crucial to e.g. apply additional validation steps or assist designers during the development process. Existing approaches focus on the extraction of static information, providing designers with models that describe the structure of their system but not its behavior. In this paper, we introduce the Automated Intra-cycle Behavioral Analysis tool, AIBA. AIBA utilizes the GNU debugger to execute a two-step analysis that retrieves behavioral and architectural information of ESL designs. The proposed method is completely non-intrusive, allowing both SystemC designs and the standard tool flow to be used without any modification. Case studies confirm the benefits of the approach.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122007863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirhossein Mirhosseini, Mohammad Sadrosadati, Maryam Zare, H. Sarbazi-Azad
{"title":"Quantifying the difference in resource demand among classic and modern NoC workloads","authors":"Amirhossein Mirhosseini, Mohammad Sadrosadati, Maryam Zare, H. Sarbazi-Azad","doi":"10.1109/ICCD.2016.7753314","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753314","url":null,"abstract":"This paper quantifies the difference in resource demand between modern and classic NoC workloads. In the paper, we show that modern workloads are able to better utilize higher numbers of VCs and smaller C factors in order to attain performance and energy efficiency. This is because of the high throughput and possible local congestions in their traffic pattern. As a result, such workloads are more suitable for concurrency and redundancy energy reduction techniques where the voltage and frequency are reduced simultaneously and the increased power budget is used for introducing additional resources to the network in order to improve the performance.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122042643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and alleviating intra-die and intra-DIMM parameter variation in the memory system","authors":"Meysam Taassori, Ali Shafiee, R. Balasubramonian","doi":"10.1109/ICCD.2016.7753283","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753283","url":null,"abstract":"Continued process scaling must overcome several manufacturing challenges. At the same time, industry is exploring many new memory technologies that require new manufacturing processes. In such challenging fabrication regimes, parameter variation (PV) and yield will be important problems. While many recent bodies of work have targeted PV in processors, few have targeted PV in the memory system. Mitigation techniques have either focused on refresh, or have focused on inter-die variation. In this work, with empirical measurements, we first show that PV and specifically intra-die PV is indeed a real phenomenon in modern DRAM chips. We show that this intra-die PV can impact timing parameters for different banks within a DRAM chip. In response to growing PV, memory timing parameters will likely be set very conservatively to accommodate the worst case. To overcome these worst-case limitations, we propose the design of a reconfigurable memory module that detects PV in the field and organizes the memory system into fast/slow regions. This requires changes to the memory controller and to buffer chips on DIMMs. Further, OS migration policies can move frequently accessed pages to the fast regions. This overall approach not only improves performance and energy, it also provides a configurable platform for systems that can tolerate errors or approximation. The proposed system yields an average performance improvement of 12.6% in DRAM systems, and 25.5% in NVM systems.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130403498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speculative path power estimation using trace-driven simulations during high-level design phase","authors":"Saumya Chandra, R. Jayaseelan, Ravi Bhargava","doi":"10.1109/ICCD.2016.7753350","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753350","url":null,"abstract":"Today power is an important design metric and the ongoing goal of microprocessor designers is to maximize performance within specified power targets. The key to achieving this goal is the ability to accurately estimate power and performance design points of future products during the high-level micro-architectural design phase (HLD). These estimates are heavily used for feature analysis and product feasibility studies. Most performance and power simulators across the industry use the trace-driven simulation model (TDM) as opposed to an execution driven model (EDM). This is because, in general, trace-driven models: (i) have faster turnaround time; (ii) require significantly lower resources in terms of disk space, CPU time and memory footprint; and (iii) are more robust, portable and well understood. However, TDM simulations lack the ability to accurately capture the flow of speculative path (or wrong path) 1 execution following a branch mispredict in an out-of-order processor pipeline. This leads to inaccuracies in power and performance estimates. On the other hand, in the EDM method, input is an executable and the model can fetch and execute instructions down the speculative path on a branch mispredict. As such it enables us to accurately account for the impact of the speculative path activity. However, it is slower, prone to failures, and has much higher development and validation effort. In this paper we compare and analyze performance and power estimates from TDM and EDM simulations for the same workload phases. We observe that the impact of wrong path on power estimates is significantly higher than on the performance estimates. Using results from our analysis, we develop a methodology to account for power consumption during wrong path execution in TDM simulations. We show that the proposed methodology can provide power estimates approaching EDM-based accuracy while not sacrificing the speed and flexibility of the trace-driven models.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keni Qiu, Junpeng Luo, Zhiyao Gong, Wei-gong Zhang, Jing Wang, Yuanchao Xu, Tao Li, C. Xue
{"title":"Refresh-aware loop scheduling for high performance low power volatile STT-RAM","authors":"Keni Qiu, Junpeng Luo, Zhiyao Gong, Wei-gong Zhang, Jing Wang, Yuanchao Xu, Tao Li, C. Xue","doi":"10.1109/ICCD.2016.7753282","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753282","url":null,"abstract":"The highlighted advantages of low leakage power, high storage density and immunity to electronic magnetic radiation make STT-RAM a promising candidate to build cache, SPM or main memory in embedded systems. However, write operations on STT-RAM have considerably longer latency and higher energy consumption than conventional SRAM. To solve this problem, researchers have proposed to relax STT-RAM's non-volatility and to have it work in a fast and low power mode. Under this volatile mode, refresh operations are needed to guarantee data correctness if their lifespan is larger than the retention time. It is observed that this refresh overhead is significant for data in stencil loops with the characteristic of constant read and write dependencies. This paper proposes a loop scheduling technique which can traverse loops in a new direction such that data lifespan can be greatly shortened. Therefore, overall refresh overhead can be efficiently mitigated so as to improve performance and reduce power consumption. The experimental results indicate that access latency and dynamic energy can be improved by 21.4~96.0% and 22.0~95.5% respectively by the proposed scheduling scheme.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123933871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hasan Erdem Yantır, M. Fouda, A. Eltawil, F. Kurdahi
{"title":"Process variations-aware resistive associative processor design","authors":"Hasan Erdem Yantır, M. Fouda, A. Eltawil, F. Kurdahi","doi":"10.1109/ICCD.2016.7753260","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753260","url":null,"abstract":"Recent breakthroughs in memristive devices have demonstrated the potential of using resistive content addressable memories for associative processing. These architectures enable ultra-high density integrated circuits along with low-power computation. However, the reliability of memristive elements is limiting the widespread adoption of these architectures. In this study, we address the reliability issues that arise in high density, resistive associative processor architectures. We propose methods to design process variation immune resistive content addressable memories and minimize the error probabilities. According to SPICE-based circuit simulations, the reliability of the circuit increases significantly and thus positively influences the accuracy of arithmetic operations as well.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124269635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy aware routing of multi-level Network-on-Chip traffic","authors":"Vasil Pano, I. Yilmaz, A. More, B. Taskin","doi":"10.1109/ICCD.2016.7753330","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753330","url":null,"abstract":"The emergence of Network-on-Chip (NoC) as a communication paradigm for Multi-Processor System-on-Chips (MPSoCs) significantly exacerbates the need to provide a methodology that optimizes the energy consumption of the overall system. This is especially important when factoring in current Network-on-Chip advances which have multiple communication media such as on-chip wireless or nano-photonics links, hybrid with traditional wired links. All of these media have different energy profiles, and if not taken into consideration the system will incur a higher power consumption throughout the runtime of the application. In this work, the case for EDP (energy-delay product) optimization between different levels of a multi-level Network-on-Chip is presented. Using a dynamic, energy aware algorithm, the EDP improvement is compared to a multi-level Network-on-Chip using a statically optimized routing. The proposed routing algorithm handles the different types of energy-delay profiles of multiple links. The end product is a methodology that lowers the overall energy consumption by optimizing the energy profile of the Network-on-Chip while also minimizing the network delay.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115029128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xian Zhu, Mihir Awatramani, D. Rover, Joseph Zambreno
{"title":"ONAC: Optimal number of active cores detector for energy efficient GPU computing","authors":"Xian Zhu, Mihir Awatramani, D. Rover, Joseph Zambreno","doi":"10.1109/ICCD.2016.7753335","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753335","url":null,"abstract":"Graphics Processing Units (GPUs) have become a prevalent platform for high throughput general purpose computing. The peak computational throughput of GPUs has been steadily increasing with each technology node by scaling the number of cores on the chip. Although this vastly improves the performance of several compute-intensive applications, our experiments show that some applications can achieve peak performance without utilizing all cores on the chip. We refer to the number of cores at which performance of an application saturates as the optimal number of active cores (Nopt). We propose executing the application on Nopt cores, and power-gating the unused cores to reduce static power consumption. Towards this target, we present ONAC (Optimal Number of Active Cores detector), a runtime technique to detect Nopt. ONAC uses a novel estimation model, which significantly reduces the number of hardware samples taken to detect the optimal core count, compared to a sequential detection technique (Seq-Det). We implement ONAC and Seq-Det in a cycle-level GPU performance simulator and analyze their effect on performance, power and energy. Our evaluation shows that ONAC and Seq-Det can reduce energy consumption by 20% and 10% on average for memory-intensive applications, without sacrificing more than 2% performance. The higher energy savings for ONAC comes from reducing the detection time by 45% as compared to Seq-Det.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114386236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Papadimitriou, D. Gizopoulos, Athanasios Chatzidimitriou, Tom Kolan, A. Koyfman, Ronny Morad, V. Sokhin
{"title":"Unveiling difficult bugs in address translation caching arrays for effective post-silicon validation","authors":"G. Papadimitriou, D. Gizopoulos, Athanasios Chatzidimitriou, Tom Kolan, A. Koyfman, Ronny Morad, V. Sokhin","doi":"10.1109/ICCD.2016.7753339","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753339","url":null,"abstract":"Post-silicon validation is one of the most important parts of the microprocessor prototype chip lifecycle. It is the last chance for debug engineers to detect defects and bugs that escaped pre-silicon verification, before the chip is released to the market. Effective solutions are required to harness the peak performance of the hardware prototype and evaluate whether the microprocessor chip is fully compliant with the instruction set and other specifications. We perform a comprehensive experimental study on a state-of-the-art microarchitecture to assess and identify the most difficult bugs in address translation caching arrays (multi-level TLBs and MMU Caches), and explain why these bugs persist across generations. We also categorize them into distinct bug scenarios. We then propose a novel methodology for generating random self-checking stimuli programs, which expose and detect such bug scenarios. Our experimental results show that the proposed method can detect difficult bugs that are likely to be missed by traditional post-silicon validation techniques.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133777274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}