{"title":"Improving processor performance by simplifying and bypassing trivial computations","authors":"J. Yi, D. Lilja","doi":"10.1109/ICCD.2002.1106814","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106814","url":null,"abstract":"During the course of a program's execution, a processor performs mangy trivial computations; that is, computations that can be simplified or where the result is zero, one, or equal to one of the input operands. This paper shows that, despite compiling a program with aggressive optimizations (-O3), approximately 30% of all arithmetic instructions, which account for 12% of all dynamic instructions, are trivial computations. The amount of trivial computation is not heavily dependent on the program's specific input values. Our results show that eliminating trivial computations dynamically at run-time yields an average speedup of 8% for a typical processor. Even for a very aggressive processor (i.e. one with no functional unit constraints), the average speedup is still 6%. It also is important to note that the area cost (i.e. hardware) required to dynamically detect and eliminate these trivial computations is very low, consisting of only a few comparators and multiplexers.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"42 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129188094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A low energy set-associative I-Cache with extended BTB","authors":"Koji Inoue, V. Moshnyaga, K. Murakami","doi":"10.1109/ICCD.2002.1106768","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106768","url":null,"abstract":"This paper proposes a low-energy instruction-cache architecture, called history-based tag-comparison (HBTC) cache. The HBTC cache attempts to re-use tag-comparison results for avoiding unnecessary way activation in set-associative caches. The cache records tag-comparison results in an extended BTB, and re-uses them for directly selecting only the hit-way which includes the target instruction. In our simulation, it is observed that the HBTC cache can achieve 62% of energy reduction, with less than 1% performance degradation, compared with a conventional cache.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121193261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-power, high-speed CMOS VLSI design","authors":"T. Kuroda","doi":"10.1109/ICCD.2002.1106787","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106787","url":null,"abstract":"Ubiquitous computing is a next generation information technology where computers and communications will be scaled further, merged together, and materialized in consumer applications. Computers will be invisible behind broadband networks as servers, while terminals will come closer to us as wearable/implantable devices, more friendly devices with sophisticated human-computer interactions. IC chips will be implanted everywhere so that things can think and talk for distributed information processing. Key technologies here are low power, low cost, and good interfaces, especially for wireless data communications. Low-power, high-speed CMOS circuit techniques are presented in this paper, including low-voltage design with variable/multiple V/sub DD//V/sub TH/ control, embedded memory technology for reducing capacitance, and low-switching activity design.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128982781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Subword sorting with versatile permutation instructions","authors":"Z. Shi, R. Lee","doi":"10.1109/ICCD.2002.1106776","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106776","url":null,"abstract":"Subword parallelism has succeeded in accelerating many multimedia applications. Subword permutation instructions have been proposed to efficiently rearrange subwords in or among registers. Bit-level permutation instructions have also been proposed recently for their importance in cryptography. However, important algorithms, especially those with many conditional control dependencies such as sorting, have not exploited the advantage of subword parallel instructions. In this paper, we show how one of the bit permutation instructions, GRP, can be used for fast sorting. In the process, we demonstrate the versatility of this permutation instruction for uses other than bit permutations. This versatility is important in considering the addition of a new instruction to a general-purpose processor. The results show that our sorting methods have a significant speedup even when compared with the fastest sorting algorithms. We also discuss the hardware implementation of the GRP instruction and compare its latency to a typical processor's cycle time.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129732400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lewis Girod, Vladimir Bychkovskiy, J. Elson, D. Estrin
{"title":"Locating tiny sensors in time and space: a case study","authors":"Lewis Girod, Vladimir Bychkovskiy, J. Elson, D. Estrin","doi":"10.1109/ICCD.2002.1106773","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106773","url":null,"abstract":"As the cost of embedded sensors and actuators drops, new applications will arise that exploit high density networks of small devices capable of a variety of sensing tasks. Although individual devices may have limited functionality, the true value of the system comes from the emergent behavior that arises when data from many places in the system is combined. This type of data fusion has a number of requirements, but two of the most important are: 1) synchronized time, precise enough to resolve movement in the sensed phenomenon (e.g., sound); and 2) known geographic locations, on a similar scale to the sensors' size and deployment density. However, the installation cost of a localization system with sufficient granularity is considerable, because of the large amount of effort required to deploy such a system and make all the measurements required to tune it. In this paper, we describe a system based on COTS components that incorporates our novel time synchronization and acoustic ranging techniques. The result is a low-cost, readily available platform for distributed, coherent signal processing.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128087691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}