{"title":"Implicit hints: Embedding hint bits in programs without ISA changes","authors":"H. Vandierendonck, K. D. Bosschere","doi":"10.1109/ICCD.2010.5647699","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647699","url":null,"abstract":"There is a large gap in knowledge about a program between the compiler, which can afford expensive analysis, and the processor, which by nature is constrained in the types of analysis it can perform. To increase processor performance, ISAs have been extended with hint bits to communicate some of the compiler's knowledge to the processor.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122058747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Sharif Ahmadian, Mahdieh Hosseingholi, A. Ejlali
{"title":"A control-theoretic energy management for fault-tolerant hard real-time systems","authors":"Ali Sharif Ahmadian, Mahdieh Hosseingholi, A. Ejlali","doi":"10.1109/ICCD.2010.5647798","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647798","url":null,"abstract":"Recently, the tradeoff between low energy consumption and high fault-tolerance has attracted a lot of attention as a key issue in the design of real-time embedded systems. Dynamic Voltage Scaling (DVS) is known as one of the most effective low energy techniques for real-time systems. It has been observed that the use of control-theoretic methods can improve the effectiveness of DVS-enabled systems. In this paper, we have investigated reducing the energy consumption of fault-tolerant hard real-time systems using feedback control theory. Our proposed feedback-based DVS method makes the system capable of selecting the proper frequency and voltage settings in order to reduce the energy consumption while guaranteeing hard real-time requirements in the presence of unpredictable workload fluctuations and faults. In the proposed method, the available slack-time is exploited by a feedback-based DVS at runtime to reduce the energy consumption. Furthermore, some slack-time is reserved for re-execution in case of faults. Simulation results show that, as compared with traditional DVS methods without fault-tolerance, our proposed approach not only significantly reduces energy consumption, but also it satisfies hard real-time constraints in the presence of faults. The transition overhead (both time and energy), caused by changing the system supply voltage, are also taken into account in our simulation experiments.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123957828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent additions to the ARMv7-A architecture","authors":"D. Brash","doi":"10.1109/ICCD.2010.5647549","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647549","url":null,"abstract":"This talk will be based on recent announcements regarding new address translation and virtualization support in the ARM architecture. It will also raise awareness around the promise of new power/performance points and opportunities that are expected from the first ARM implementation, the Cortex-A15 core.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126476389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Package-Aware Scheduling of embedded workloads for temperature and Energy management on heterogeneous MPSoCs","authors":"Shervin Sharifi, T. Simunic","doi":"10.1109/ICCD.2010.5647628","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647628","url":null,"abstract":"In this paper, we present PASTEMP, a solution for Package Aware Scheduling for Thermal and Energy management using Multi- Parametric programming in heterogeneous embedded multiprocessor SoCs (MPSoCs). Based on the current thermal state of the system and current performance requirements of the workload, PASTEMP finds thermally safe and energy efficient voltage/frequency configurations for the cores on a MPSoC. The tasks are assigned to the cores depending on their performance demand and the current voltage/frequency of the core. The voltage/frequency settings of the cores are chosen through an optimization process which is based on the instantaneous thermal model we introduce to decouple the effect of package temperature from the temperature changes caused by the power consumption of the cores. To be able to find the best voltage/frequency settings at runtime, we use multi-parametric programming to separate the optimization into offline and online phases. According to our experimental results, compared to similar DTM techniques, PASTEMP results in up to 23% energy saving and 26% throughput improvement and reduces the deadline misses to more than a half while meeting all thermal constraints.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124282047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An energy model for graphics processing units","authors":"Jeff Pool, A. Lastra, Montek Singh","doi":"10.1109/ICCD.2010.5647678","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647678","url":null,"abstract":"We present an energy model for a graphics processing unit (GPU) that is based on the amount and type of work performed in various parts of the unit. By designing and running directed tests on a GPU, we measure the energy consumed when performing different arithmetic and memory operations, allowing us to accurately predict the energy that any arbitrary mix of operations will take. With some knowledge of how data travels through and is transformed by the graphics pipeline, we can predict how many of each operation will occur for a given scene, leading to an estimate of the energy usage. We validate our model against different types of existing graphical applications. With an average difference of 3% from measured energy under typical workloads, our model can be used for various purposes. In this work, we explore and present two use cases: 1) predicting the energy performance of applications on a different architecture, and 2) exploring the energy efficiency of different algorithms to achieve the same graphical effect.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"446 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123055405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Patil, A. Lyle, J. Harms, D. Lilja, Jianping Wang
{"title":"Spintronic logic gates for spintronic data using magnetic tunnel junctions","authors":"S. Patil, A. Lyle, J. Harms, D. Lilja, Jianping Wang","doi":"10.1109/ICCD.2010.5647611","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647611","url":null,"abstract":"The emerging field of spintronics is undergoing exciting developments with the advances recently seen in spintronic devices, such as magnetic tunnel junctions (MTJs). While they make excellent memory devices, recently they have also been used to accomplish logic functions. The properties of MTJs are greatly different from those of electronic devices like CMOS semiconductors. This makes it challenging to design circuits that can efficiently leverage the spintronic capabilities. The current approaches to achieving logic functionality with MTJs include designing an integrated CMOS and MTJ circuit, where CMOS devices are used for implementing the required intermediate read and write circuitry. The problem with this approach is that such intermediate circuitry adds overheads of area, delay and power consumption to the logic circuit. In this paper, we present a circuit to accomplish logic operations using MTJs on data that is stored in other MTJs, without an intermediate electronic circuitry. This thus reduces the performance overheads of the spintronic circuit while also simplifying fabrication. With this circuit, we discuss the notion of performing logic operations with a non-volatile memory device and compare it with the traditional method of computation with separate logic and memory units. We find that the MTJ-based logic unit has the potential to offer a higher energy-delay efficiency than that of a CMOS-based logic operation on data stored in a separate memory module.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122891529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A study on performance benefits of core morphing in an asymmetric multicore processor","authors":"Anup Das, Rance Rodrigues, I. Koren, S. Kundu","doi":"10.1109/ICCD.2010.5647566","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647566","url":null,"abstract":"Multicore architectures are designed so as to provide an acceptable level of performance per unit power for the majority of applications. Consequently, we must occasionally expect applications that could have benefited from a more powerful core in terms of either lower execution time and/or lower energy consumed. Fusing some of the resources of two (or more) cores to configure a more powerful core for such instances is a natural approach to deal with those few applications that have very high performance demands. However, a recent study has shown that fusing homogeneous cores is unlikely to benefit applications. In this paper we study the potential performance benefits of core morphing in a heterogeneous multicore processor that can be reconfigured at runtime. We consider as an example a dual core processor with one of the two cores being designed to target integer intensive applications while the other is better suited to floating-point intensive applications. These two cores can be fused into a single powerful core when an application that can benefit from such fusion is executing. We first discuss the design principles of the two individual cores so that the majority of the benchmarks that we consider execute in a satisfactory way. We then show that a small subset of the considered applications can greatly benefit from core morphing even in the case where two applications that could have been executed in parallel on the two cores are run, for some percentage of time, on the single morphed core. Our results indicate that a performance gain of up to 100% is achievable at a small hardware overhead of less than 1%.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126775871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Insertion policy selection using Decision Tree Analysis","authors":"S. Khan, Daniel A. Jiménez","doi":"10.1109/ICCD.2010.5647608","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647608","url":null,"abstract":"The last-level cache (LLC) mitigates the impact of long memory access latencies in today's microarchitectures. The insertion policy in the LLC has a significant impact on cache efficiency. A fixed insertion policy can allow useless blocks to remain in the cache longer than necessary, resulting in inefficiency. We introduce insertion policy selection using Decision Tree Analysis (DTA). The technique requires minimal hardware modification over the least-recently-used (LRU) replacement policy. This policy uses the fact that the LLC filters temporal locality. Many of the lines brought to the cache are never accessed again. Even if they are reaccessed they do not experience bursts, but rather they are reused when they are near to the LRU position in the LRU stack. We use decision tree analysis of multi-set-dueling to choose the optimal insertion position in the LRU stack. Inserting in this position, zero reuse lines minimize their dead time while the non-zero reuse lines remain in the cache long enough to be reused and avoid a miss. For a 1MB 16 way set-associative last level cache in a single core processor, our policy uses only 2,069 additional bits over the LRU replacement policy. On average it reduces misses by 5.16% and achieves 7.19% IPC improvement over LRU.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126552643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lightweight run-time scheduler for multitasking multicore stream applications","authors":"Michael A. Baker, Karam S. Chatha","doi":"10.1109/ICCD.2010.5647732","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647732","url":null,"abstract":"Stream programming models promise dramatic improvements in developers' ability to express parallelism in their applications while enabling extremely efficient implementations on modern many-core processors. Unfortunately, the wide variation in the architectural features of available multi-core processors implies that a single compiler may be incapable of generating general solutions which can run on many target systems, or even on different configurations of the same system. In particular, off-line approaches for finding optimal mappings and schedules for a stream program on a specific processor are limited by their lack of portability across different processors, and by a lack of flexibility for run time variations in resource availability in typical multi-tasking environments. The paper presents a scheme that includes a lightweight compile-time sequencer, and a dynamic scheduler capable of mapping stream programs onto available cores in a multi-core processor at run-time. Unlike previous implementations, our scheme requires limited knowledge of the target architecture's resources at compile time. The off-line portion of the scheme generates canonical scheduling information about the stream program. This information is utilized by the lightweight run-time scheduling algorithm to generate application mappings in linear time based on available resources giving near optimal throughput. Evaluations of schedules generated for twelve streaming benchmarks gives an average of 96% and 93% of the theoretical optimum throughput for schedules with up to 4 and 128 cores, respectively.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126590098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A simple pipelined logarithmic multiplier","authors":"P. Bulić, Z. Babic, A. Avramović","doi":"10.1109/ICCD.2010.5647767","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647767","url":null,"abstract":"Digital signal processing algorithms often rely heavily on a large number of multiplications, which is both time and power consuming. However, there are many practical solutions to simplify multiplication, like truncated and logarithmic multipliers. These methods consume less time and power but introduce errors. Nevertheless, they can be used in situations where a shorter time delay is more important than accuracy. In digital signal processing, these conditions are often met, especially in video compression and tracking, where integer arithmetic gives satisfactory results. This paper presents and compare different multipliers in a logarithmic number system. For the hardware implementation assessment, the multipliers are implemented on the Spartan 3 FPGA chip and are compared against speed, resources required for implementation, power consumption and error rate. We also propose a simple and efficient logarithmic multiplier with the possibility to achieve an arbitrary accuracy through an iterative procedure. In such a way, the error correction can be done almost in parallel (actually this is achieved through pipelining) with the basic multiplication. The hardware solution involves adders and shifters, so it is not gate and power consuming. The error of proposed multiplier for operands ranging from 8 bits to 16 bits indicates a very low relative error percentage.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124239359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}