2008 IEEE International Conference on Computer Design最新文献

筛选
英文 中文
A simple latency tolerant processor 一个简单的延迟容忍处理器
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751889
Satyanarayana Nekkalapu, Haitham Akkary, K. Jothi, Renjith Retnamma, Xiaoyu Song
{"title":"A simple latency tolerant processor","authors":"Satyanarayana Nekkalapu, Haitham Akkary, K. Jothi, Renjith Retnamma, Xiaoyu Song","doi":"10.1109/ICCD.2008.4751889","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751889","url":null,"abstract":"The advent of multi-core processors and the emergence of new parallel applications that take advantage of such processors pose difficult challenges to designers. With relatively constant die sizes, limited on chip cache, and scarce pin bandwidth, more cores on chip reduces the amount of available cache and bus bandwidth per core, therefore exacerbating the memory wall problem. How can a designer build a processor that provides a core with good single-thread performance in the presence of long latency cache misses, while enabling as many of these cores to be placed on the same die for high throughput. Conventional latency tolerant architectures that use out-of-order superscalar execution have become too complex and power hungry for the multi-core era. Instead, we present a simple, non-blocking architecture that achieves memory latency tolerance without requiring complex out-of-order execution hardware or large, cycle-critical and power hungry structures, such as dynamic schedulers, fully associative load and store queues, and reorder buffers. The non-blocking property of this architecture provides tolerance to hundreds of cycles of cache miss latency on a simple in-order issue core, thus allowing many more such cores to be integrated on the same die than is possible with conventional out-of-order superscalar architecture.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA 基于FPGA的RNA二级结构预测的细粒度并行专用计算
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1142/S0218126614500315
Qianghua Zhu, Fei Xia, Guoqing Jin
{"title":"Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA","authors":"Qianghua Zhu, Fei Xia, Guoqing Jin","doi":"10.1142/S0218126614500315","DOIUrl":"https://doi.org/10.1142/S0218126614500315","url":null,"abstract":"In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Run-time Active Leakage Reduction by power gating and reverse body biasing: An eNERGY vIEW 运行时主动泄漏减少功率门控和反向体偏置:一个能源的观点
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751925
Hao Xu, R. Vemuri, W. Jone
{"title":"Run-time Active Leakage Reduction by power gating and reverse body biasing: An eNERGY vIEW","authors":"Hao Xu, R. Vemuri, W. Jone","doi":"10.1109/ICCD.2008.4751925","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751925","url":null,"abstract":"Run-time active leakage reduction (RALR) is a recent technique and aims at aggressively reducing leakage power consumption. This paper studies the feasibility of RALR from the energy aspect, for both power gating (PG) and reverse body bias (RBB) implementations.We develop two energy saving models for PG and RBB, respectively. These models can accurately estimate the circuit energy saving at any time, even when the circuit is in state transition. In PG modeling, we discover a physical phenomenon called ldquoinstant savingrdquo, which can affect the model accuracy by 30%-50%. Based on the RBB model, we derive the optimum design point of RBB for RALR. Finally in terms of energy saving, we define four figures-of-merit, to compare the efficacy of using PG and RBB to implement RALR.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131330785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Near-optimal oblivious routing on three-dimensional mesh networks 三维网状网络的近最优遗忘路由
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751852
R. Ramanujam, Bill Lin
{"title":"Near-optimal oblivious routing on three-dimensional mesh networks","authors":"R. Ramanujam, Bill Lin","doi":"10.1109/ICCD.2008.4751852","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751852","url":null,"abstract":"The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of two-dimensional (2D) mesh-based tiled chip-multiprocessor architectures into three dimensions. In this paper, we focus on efficient routing algorithms for such 3D mesh networks. As in the case of 2D mesh networks, throughput and latency are important design metrics for routing algorithms. Existing routing algorithms suffer from either poor worst-case throughput (DOR , ROMM) or poor latency (VAL). Although the minimal routing algorithm O1TURN proposed in already achieves near-optimal worst-case throughput for the 2D case, the optimality result does not extend to higher dimensions. For 3D and higher dimensional meshes, the worst-case throughput of O1TURN degrades tremendously. The main contribution of this paper is the design of a new oblivious routing algorithm for 3D mesh networks called randomized partially-minimal (RPM) routing. RPM provably achieves optimal worst-case throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal worst-case throughput when k is odd. RPM also outperforms VAL, DOR, ROMM, and O1TURN in average-case throughput by 33.3%, 111%, 47%, and 30%, respectively when averaged over one million random traffic patterns on an 8 times 8 times 8 topology. Finally, whereas VAL achieves optimal worst-case throughput at a penalty factor of 2 in average latency over DOR, RPM achieves (near) optimal worst-case throughput with a much smaller factor of 1.33. In practice, the average latency of RPM is expected to be closer to minimal routing because 3D mesh networks are not expected to be symmetric in 3D chip designs. The number of available device layers is expected to be much less than the number of processor tiles that can be placed along an edge of a device layer. For practical asymmetric 3D mesh configurations, the average latency of RPM reduces to just a factor of 1.11 of DOR.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130989546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Exploiting spare resources of in-order SMT processors executing hard real-time threads 利用执行硬实时线程的有序SMT处理器的空闲资源
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751887
Jörg Mische, S. Uhrig, Florian Kluge, T. Ungerer
{"title":"Exploiting spare resources of in-order SMT processors executing hard real-time threads","authors":"Jörg Mische, S. Uhrig, Florian Kluge, T. Ungerer","doi":"10.1109/ICCD.2008.4751887","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751887","url":null,"abstract":"We developed an SMT processor that allows a static WCET analysis of several hard real-time threads and uses the remaining resources for soft or non real-time threads. The analysis is possible, because one Dominant Meta Thread (DMT) is executed as if it were the unique thread on the processor and thus single-threaded WCET techniques can be applied. To provide more than one hard real-time thread the execution time of the Dominant Meta Thread is distributed by time sharing whereby the length of the time slices and periods can be adjusted at runtime. Our technique, called Dominant Time Sharing (DTS), can be used to minimize the number of control units in embedded hard real-time systems and hence reduces the overall energy consumption and material demand. In contrast to many other studies we are able to handle multicycle memory latencies while preserving analyzability. The proposed technique can easily be extended to access other external resources like coprocessors or reconfigurable arrays.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Frequency and voltage planning for multi-core processors under thermal constraints 热约束下多核处理器的频率和电压规划
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751902
M. Kadin, S. Reda
{"title":"Frequency and voltage planning for multi-core processors under thermal constraints","authors":"M. Kadin, S. Reda","doi":"10.1109/ICCD.2008.4751902","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751902","url":null,"abstract":"Clock frequency and transistor density increases have resulted in elevated chip temperatures. In order to meet temperature constraints while still exploiting the performance opportunities enabled by continued scaling, chip designers have migrated towards multi-core architectures. Multi-core architectures use multiple cores running at moderate clock frequencies to run several threads concurrently, which increases overall system throughput. In this work, we propose novel methods to find the optimal operating parameters, i.e., frequency and voltage, that maximize a multi-core system throughput under thermal constraints. By adjusting core clock frequencies and voltages, on-chip power dissipation can be spatially and temporally distributed to maximize the chippsilas physical performance during runtime. We propose a simple, yet efficient model that accurately characterize the effects that changes in clock frequency and voltage have on on-chip temperatures. Using the model, we find the optimal operating conditions for the following scenarios: (1) standard processor performance, where various cores operate using identical operating parameters, (2) optimal processor performance where each core can have its own frequency and voltage, and (3) optimal processor performance with thread priorities, where each core runs a thread of varied importance. We run several experiments across six different technology nodes to validate the work, assuring that our models and methods are accurate. Our methods demonstrate the total physical performance of a multi-core system can be increased by up to 33.4% without violating the maximum temperature constraints.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133541451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Variation-aware thermal characterization and management of multi-core architectures 多核架构的变化感知热特性和管理
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751874
E. Kursun, Chen-Yong Cher
{"title":"Variation-aware thermal characterization and management of multi-core architectures","authors":"E. Kursun, Chen-Yong Cher","doi":"10.1109/ICCD.2008.4751874","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751874","url":null,"abstract":"The accuracy and efficiency of dynamic power and thermal management are both affected by the increased levels of on-chip variation, mainly because dynamic thermal management schemes are oblivious to the variation characteristics of the underlying hardware. We propose a technique that utilizes the existing on-chip sensor infrastructure to improve the inherent thermal imbalances among different cores in a multi-core architecture. Thermal sensor readings are compiled to generate an on-chip variation map, which is provided to the system power/thermal management to effectively manage the existing on-chip variation. Experimental analysis based on live measurements on a special test-chip shows reduced on-chip heating with no performance loss, which improves the power/thermal efficiency of the chip at no cost.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124366123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Fast arbiters for on-chip network switches 片上网络交换机的快速仲裁器
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751932
G. Dimitrakopoulos, N. Chrysos, C. Galanopoulos
{"title":"Fast arbiters for on-chip network switches","authors":"G. Dimitrakopoulos, N. Chrysos, C. Galanopoulos","doi":"10.1109/ICCD.2008.4751932","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751932","url":null,"abstract":"The need for efficient implementation of simple crossbar schedulers has increased in the recent years due to the advent of on-chip interconnection networks that require low latency message delivery. The core function of any crossbar scheduler is arbitration that resolves conflicting requests for the same output. Since, the delay of the arbiters directly determine the operation speed of the scheduler, the design of faster arbiters is of paramount importance. In this paper, we present a new bit-level algorithm and new circuit techniques for the design of programmable priority arbiters that offer significantly more efficient implementations compared to already-known solutions. From the experimental results it is derived that the proposed circuits are more than 15% faster than the most efficient previous implementations, which under equal delay comparisons, translates to 40% less energy.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124716652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework CrashTest:基于fpga的快速高保真弹性分析框架
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751886
Andrea Pellegrini, Kypros Constantinides, Dan Zhang, Shobana Sudhakar, V. Bertacco, T. Austin
{"title":"CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework","authors":"Andrea Pellegrini, Kypros Constantinides, Dan Zhang, Shobana Sudhakar, V. Bertacco, T. Austin","doi":"10.1109/ICCD.2008.4751886","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751886","url":null,"abstract":"Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Acceleration of a 3D target tracking algorithm using an application specific instruction set processor 使用特定指令集处理器的3D目标跟踪算法的加速
2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751870
S. Fontaine, Sylvain Goyette, J. Langlois, G. Bois
{"title":"Acceleration of a 3D target tracking algorithm using an application specific instruction set processor","authors":"S. Fontaine, Sylvain Goyette, J. Langlois, G. Bois","doi":"10.1109/ICCD.2008.4751870","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751870","url":null,"abstract":"In todaypsilas high-tech world, intelligent video-surveillance is becoming a part of everyday life. In addition to minimizing the need for constant monitoring by an operator, it can automatically perform tasks such as accident detection or estimation of vehicle speed. A particularly useful algorithm for video surveillance is three-dimensional target tracking but, since it is both quite computationally expensive and requires the use of two cameras, it is seldom used. In this paper, we concentrate on accelerating an implementation of 3D tracking using a multiprocessor ASIP architecture based on the Tensilica Xtensa processor. Our experiments show that a speedup factor of 22 can be achieved using an extensible platform expressly optimized for this application as opposed to using a general-purpose processor.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131831088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信