{"title":"Improving the interface performance of synthesized structural FAME simulators through scheduling","authors":"D. Penry","doi":"10.1109/ICCD.2015.7357086","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357086","url":null,"abstract":"Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs (also known as FAME simulators) can overcome the speed limitations of software-only simulators. However such simulators must be automatically synthesized or the time to design them becomes prohibitive. Previous work has shown that synthesized simulators should use a latency-insensitive design style in the hardware and a concurrent interface with the software. We show that the performance of the interface in such a simulator can be improved significantly by scheduling all communication between hardware and software. Scheduling reduces the amount of hardware/software communication and reduces software overhead. Scheduling is made possible by exploiting the properties of the latency-insensitive design technique recommended in previous work. We observe speedups of up to 1.54 versus the former interface for a multi-core simulator.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133533744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Runtime multi-optimizations for energy efficient on-chip interconnections1","authors":"Yuan He, Masaaki Kondo, Takashi Nakada, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura","doi":"10.1109/ICCD.2015.7357147","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357147","url":null,"abstract":"On-chip interconnection (or NoC) is a major performance and power contributor to modern and future multicore processors. So far, many optimization techniques have been developed to improve its bandwidth, latency and power consumption. But it is not clear how energy efficiency is affected since an optimization technique normally comes with overheads. This paper thus attempts to address when and how such optimization techniques should be applied and tuned to help achieve better energy efficiency. We firstly model the performance and energy impacts of representative NoC optimization techniques. These models help us more easily understand the consequences when applying these optimization techniques and their combinations under different circumstances. Moreover, based on such modeling, we propose and implement an adaptive control over these NoC optimization techniques to improve both performance and energy efficiency of the network. Our results show that, this proposal can achieve an average improvement of 26% and 57% on network performance and energy delay product, respectively.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132351555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FDRAM: DRAM architecture flexible in successive row and column accesses","authors":"Jeongjae Yu, Wooyoung Jang","doi":"10.1109/ICCD.2015.7357146","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357146","url":null,"abstract":"Depending on the screen orientations of mobile devices, image and graphic applications access dynamic random access memory (DRAM) data in the complicated manner. Since conventional DRAMs cannot provide successive data in the same column, however, they achieve poor performance, and high power consumption in any screen orientations. This paper presents a novel DRAM efficient for both portrait and landscape screen orientations of mobile systems. We develop a simple, yet effective memory cell that is activated not only by a row activation command, but also by a new column activation command. Next, our DRAM architecture makes all the memory cells either on the same row, or on the same column successively accessed. The proposed DRAM commands and states are completely compatible with the conventional DRAM. Experimental results show the proposed DRAM is 2.6% more expensive, but achieves 5.8% higher memory utilization, 14.8% shorter memory latency, and 4.4% less memory power consumption than the conventional DRAM on average.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115228815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthikeyan P. Saravanan, P. Carpenter, Alex Ramírez
{"title":"Exploring multiple sleep modes in on/off based energy efficient HPC networks","authors":"Karthikeyan P. Saravanan, P. Carpenter, Alex Ramírez","doi":"10.1109/ICCD.2015.7357084","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357084","url":null,"abstract":"Energy efficiency is one of the key challenges in high-performance computing (HPC). The current target of 1 ExaFlop in 20 MW requires a ten-fold improvement in energy efficiency, which is only possible through significant improvements in the energy efficiency throughout the system. Interconnects are particularly inefficient, since their links are always on, consuming full power in order to provide low latency, even though the average interconnect utilization is low. To address the above, the Ethernet standards committee in-charge of 40/100/400Gb Ethernet has opted to include protocols that define low power modes, specifically Fast-Wake, alongside the older Deep-Sleep, to make interconnect links energy proportional. With these standards ratified as recently as March 2014, it is unclear how these low power modes can be used in HPC. While energy efficiency is critical, techniques with excessive performance overheads are unlikely to be adopted in HPC. To this end, this paper performs the first detailed analysis of Fast-Wake mode for link energy savings in the context of HPC. Our results show that a combination of Fast-Wake and Deep-Sleep can reduce link energy savings by up to 70% with less than 1% performance overheads. However, we show how the parameters of these low power modes must be carefully configured to obtain the right trade-offs in energy and performance. We believe that our analysis could benefit interconnect vendors looking to use these low power modes for deployment in HPC.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pool directory: Efficient coherence tracking with dynamic directory allocation in many-core systems","authors":"Sudhanshu Shukla, Mainak Chaudhuri","doi":"10.1109/ICCD.2015.7357165","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357165","url":null,"abstract":"The coherence directory in a chip-multiprocessor keeps track of each memory block inside the cache hierarchy and plays a significant role in offering a scalable shared memory abstraction in many-core systems. Multi-threaded applications typically require two types of directory entries, namely, limited pointer entries tracking a few sharers of a block and bitvector entries tracking larger number of sharers for widely shared blocks. Recent proposals aiming to optimize the average number of bits per directory entry have organized the directory as either a static mix of these two types of entries or a collection of relatively short bitvector entries that can encode either a limited number of sharer pointers or a larger number of sharers hierarchically. In this paper, we present a directory organization that facilitates allocation of two different types of directory entries dynamically. Our design maintains a pool of limited pointer entries, where each entry can also double as a segment directory entry encoding the sharers in a cluster of cores. Each tag in the primary sparse directory array has a pointer that can either represent a sharer or point to an entry in the pool. When multiple segment directory entries are needed to encode all the sharers of a block, our pool management protocol guarantees that all these entries are allocated contiguously so that maintaining a pointer to the head entry is enough. Such a design offers significant flexibility in sharer encoding and allows us to independently size the sparse directory array and the pool. Detailed simulation results show that our proposal incorporated in a 128-core system running multi-threaded applications drawn from scientific, general-purpose, and commercial computing domains can offer, on average, 5% improvement in performance and 20% savings in interconnect traffic compared to the state-of-the-art scalable coherence directory (SCD) proposal when using a 1/16 × sparse directory.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113954494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential C-code to distributed pipelined heterogeneous MPSoC synthesis for streaming applications","authors":"Jude Angelo Ambrose, Yusuke Yachide, Kapil Batra, Jorgen Peddersen, S. Parameswaran","doi":"10.1109/ICCD.2015.7357106","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357106","url":null,"abstract":"Pipeline of processors allow the execution of a sequential streaming program on multiple processors. However, partitioning sequential code for Multiprocessor Systems-on-Chips (MPSoCs), and then creating the MPSoC platform for the sequential code to execute, is a challenging problem. Parallelizing/pipelining statements within a control loop will improve the throughput of each iteration and the overall performance. Existing techniques, such as OpenMP, for parallelizing control loops is agnostic of the underlying MPSoC architecture, thus limiting the possibilities for further parallelisation. Previous techniques related to distribution of statements to MPSoCs considered homogeneous processors and not automated. In this paper, we propose a novel automated parallelization/ pipelining approach to synthesize a heterogeneous distributed pipelined MPSoC to improve the throughput of a loop (critical for streaming applications). An Integer Linear Programming (ILP)-based formulation to map statements to processor configurations is presented, in order to find the most suitable heterogeneous processor configurations for maximal throughput. Our approach complements state-of-the-art parallelization techniques, such as OpenMP, to further improve the performance of an application. A complete MPSoC platform, for the Tensilica framework, is automatically generated within minutes using our approach for the tested applications.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125404201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ichihara, Tomoya Inaoka, T. Iwagaki, Tomoo Inoue
{"title":"Logic simplification by minterm complement for error tolerant application","authors":"H. Ichihara, Tomoya Inaoka, T. Iwagaki, Tomoo Inoue","doi":"10.1109/ICCD.2015.7357089","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357089","url":null,"abstract":"Error tolerant applications can tolerate specific errors, whose frequency and/or severity are within certain limits. This error tolerability is greatly instrumental in simplifying logic circuits for such applications. In this paper, we propose a logic simplification method for error tolerant application. Owing to error tolerance, we have an opportunity to complement several minterms of a given logic function within a threshold; if appropriate minterms are selected to be complemented, the given function can be greatly simplified. To select such minterms, we focus on two transformations, expansion and reduction, of prime implicants of the given logic function, and discuss the effect of the transformations on logic simplification. The proposed algorithm utilizing such transformations can efficiently find effective minterm complement. Experimental results show that, compared with a previous method, which employs only expansion of prime implications, the proposed algorithm can produce smaller logic circuits with reasonable computational effort.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125436098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eishi Arima, H. Noguchi, Takashi Nakada, Shinobu Miwa, S. Takeda, S. Fujita, Hiroshi Nakamura
{"title":"Immediate sleep: Reducing energy impact of peripheral circuits in STT-MRAM caches","authors":"Eishi Arima, H. Noguchi, Takashi Nakada, Shinobu Miwa, S. Takeda, S. Fujita, Hiroshi Nakamura","doi":"10.1109/ICCD.2015.7357096","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357096","url":null,"abstract":"Implementing last level caches (LLCs) with STT-MRAM is a promising approach for designing energy efficient microprocessors due to high density and low leakage power of its memory cells. However, peripheral circuits of an STT-MRAM cache still suffer from leakage power because large and leaky transistors are required to drive large write current to STT-MRAM element. To overcome this problem, we propose a new power management scheme called Immediate Sleep (IS). IS immediately turns off a subarray of an STT-MRAM cache if the next access is predicted to be not critical in performance. Thus, IS can effectively reduce leakage energy with little impact on performance. Our experimental results show that our technique can save the leakage energy of an STT-MRAM LLC by 32% compared to an STT-MRAM LLC with the conventional scheme at the same performance.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129688283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Bolchini, Gianluca Durelli, A. Miele, G. Pallotta, M. Santambrogio
{"title":"An orchestrated approach to efficiently manage resources in heterogeneous system architectures","authors":"C. Bolchini, Gianluca Durelli, A. Miele, G. Pallotta, M. Santambrogio","doi":"10.1109/ICCD.2015.7357104","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357104","url":null,"abstract":"Nowadays, we are witnessing trends in technology, fabrication processes and computing architectures that lead to the design and development of processing systems constituted by a relevant number of independent, heterogeneous execution resources. The aim is to achieve high-performance while leveraging on other aspects, such as energy consumption. Indeed, heterogeneity comes at the cost of greater design and management complexity. To reach an optimal solution, system architects need to take into account the efficiency of systems' units, i.e., general purpose processors eventually with one or more kinds of accelerators (e.g., GPUs or FPGAs), as well as the workload. This often leads to inefficiency in the exploitation of such resources, and therefore in performance/energy. Within this context, we are proposing a runtime resource manager able to observe the system execution and to dynamically optimise its behaviour with respect to one or more identified functional parameters, according to the architectural characteristics, and the users' and the applications' needs. Such an adaptation characteristic is intrinsically embedded in the device as a software layer, called Orchestrator, able to adapt the runtime resource management according to the target objectives and to the inputs from the external environment.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129774046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Cortadella, L. Lavagno, P. López, Marc Lupon, A. Moreno-Conde, Antoni Roca, S. Sapatnekar
{"title":"Reactive clocks with variability-tracking jitter","authors":"J. Cortadella, L. Lavagno, P. López, Marc Lupon, A. Moreno-Conde, Antoni Roca, S. Sapatnekar","doi":"10.1109/ICCD.2015.7357159","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357159","url":null,"abstract":"The growing variability in nanoelectronic devices, due to uncertainties from the manufacturing process and environmental conditions (power supply, temperature, aging), requires increasing design guardbands, forcing circuits to work with conservative clock frequencies. Various schemes for clock generation based on ring oscillators and adaptive clocks have been proposed with the goal to mitigate the power and performance losses attributable to variability. However, there has been no systematic analysis to quantify the benefits of such schemes and no sign-off method has been proposed for timing correctness. This paper presents and analyzes a Reactive Clocking scheme with Variability-Tracking Jitter (RClk) that uses variability as an opportunity to reduce power by continuously adjusting the clock frequency to the varying environmental conditions, and thus, reduces guardband margins significantly. Power can be reduced between 20% and 40% at iso-performance and performance can be boosted by similar amounts at iso-power. Additionally, energy savings can be translated to substantial advantages in terms of reliability and thermal management. More importantly, the technology can be adopted with minimal modifications to conventional EDA flows.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129461640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}