2008 IEEE International Conference on Computer Design最新文献_第10页

A simple latency tolerant processor 一个简单的延迟容忍处理器

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751889

Satyanarayana Nekkalapu, Haitham Akkary, K. Jothi, Renjith Retnamma, Xiaoyu Song

{"title":"A simple latency tolerant processor","authors":"Satyanarayana Nekkalapu, Haitham Akkary, K. Jothi, Renjith Retnamma, Xiaoyu Song","doi":"10.1109/ICCD.2008.4751889","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751889","url":null,"abstract":"The advent of multi-core processors and the emergence of new parallel applications that take advantage of such processors pose difficult challenges to designers. With relatively constant die sizes, limited on chip cache, and scarce pin bandwidth, more cores on chip reduces the amount of available cache and bus bandwidth per core, therefore exacerbating the memory wall problem. How can a designer build a processor that provides a core with good single-thread performance in the presence of long latency cache misses, while enabling as many of these cores to be placed on the same die for high throughput. Conventional latency tolerant architectures that use out-of-order superscalar execution have become too complex and power hungry for the multi-core era. Instead, we present a simple, non-blocking architecture that achieves memory latency tolerance without requiring complex out-of-order execution hardware or large, cycle-critical and power hungry structures, such as dynamic schedulers, fully associative load and store queues, and reorder buffers. The non-blocking property of this architecture provides tolerance to hundreds of cycles of cache miss latency on a simple in-order issue core, thus allowing many more such cores to be integrated on the same die than is possible with conventional out-of-order superscalar architecture.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA 基于FPGA的RNA二级结构预测的细粒度并行专用计算

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1142/S0218126614500315

Qianghua Zhu, Fei Xia, Guoqing Jin

{"title":"Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA","authors":"Qianghua Zhu, Fei Xia, Guoqing Jin","doi":"10.1142/S0218126614500315","DOIUrl":"https://doi.org/10.1142/S0218126614500315","url":null,"abstract":"In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Run-time Active Leakage Reduction by power gating and reverse body biasing: An eNERGY vIEW 运行时主动泄漏减少功率门控和反向体偏置:一个能源的观点

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751925

Hao Xu, R. Vemuri, W. Jone

引用次数: 21

Near-optimal oblivious routing on three-dimensional mesh networks 三维网状网络的近最优遗忘路由

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751852

R. Ramanujam, Bill Lin

{"title":"Near-optimal oblivious routing on three-dimensional mesh networks","authors":"R. Ramanujam, Bill Lin","doi":"10.1109/ICCD.2008.4751852","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751852","url":null,"abstract":"The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of two-dimensional (2D) mesh-based tiled chip-multiprocessor architectures into three dimensions. In this paper, we focus on efficient routing algorithms for such 3D mesh networks. As in the case of 2D mesh networks, throughput and latency are important design metrics for routing algorithms. Existing routing algorithms suffer from either poor worst-case throughput (DOR , ROMM) or poor latency (VAL). Although the minimal routing algorithm O1TURN proposed in already achieves near-optimal worst-case throughput for the 2D case, the optimality result does not extend to higher dimensions. For 3D and higher dimensional meshes, the worst-case throughput of O1TURN degrades tremendously. The main contribution of this paper is the design of a new oblivious routing algorithm for 3D mesh networks called randomized partially-minimal (RPM) routing. RPM provably achieves optimal worst-case throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal worst-case throughput when k is odd. RPM also outperforms VAL, DOR, ROMM, and O1TURN in average-case throughput by 33.3%, 111%, 47%, and 30%, respectively when averaged over one million random traffic patterns on an 8 times 8 times 8 topology. Finally, whereas VAL achieves optimal worst-case throughput at a penalty factor of 2 in average latency over DOR, RPM achieves (near) optimal worst-case throughput with a much smaller factor of 1.33. In practice, the average latency of RPM is expected to be closer to minimal routing because 3D mesh networks are not expected to be symmetric in 3D chip designs. The number of available device layers is expected to be much less than the number of processor tiles that can be placed along an edge of a device layer. For practical asymmetric 3D mesh configurations, the average latency of RPM reduces to just a factor of 1.11 of DOR.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130989546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Exploiting spare resources of in-order SMT processors executing hard real-time threads 利用执行硬实时线程的有序SMT处理器的空闲资源

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751887

Jörg Mische, S. Uhrig, Florian Kluge, T. Ungerer

引用次数: 20

Frequency and voltage planning for multi-core processors under thermal constraints 热约束下多核处理器的频率和电压规划

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751902

M. Kadin, S. Reda

{"title":"Frequency and voltage planning for multi-core processors under thermal constraints","authors":"M. Kadin, S. Reda","doi":"10.1109/ICCD.2008.4751902","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751902","url":null,"abstract":"Clock frequency and transistor density increases have resulted in elevated chip temperatures. In order to meet temperature constraints while still exploiting the performance opportunities enabled by continued scaling, chip designers have migrated towards multi-core architectures. Multi-core architectures use multiple cores running at moderate clock frequencies to run several threads concurrently, which increases overall system throughput. In this work, we propose novel methods to find the optimal operating parameters, i.e., frequency and voltage, that maximize a multi-core system throughput under thermal constraints. By adjusting core clock frequencies and voltages, on-chip power dissipation can be spatially and temporally distributed to maximize the chippsilas physical performance during runtime. We propose a simple, yet efficient model that accurately characterize the effects that changes in clock frequency and voltage have on on-chip temperatures. Using the model, we find the optimal operating conditions for the following scenarios: (1) standard processor performance, where various cores operate using identical operating parameters, (2) optimal processor performance where each core can have its own frequency and voltage, and (3) optimal processor performance with thread priorities, where each core runs a thread of varied importance. We run several experiments across six different technology nodes to validate the work, assuring that our models and methods are accurate. Our methods demonstrate the total physical performance of a multi-core system can be increased by up to 33.4% without violating the maximum temperature constraints.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133541451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Variation-aware thermal characterization and management of multi-core architectures 多核架构的变化感知热特性和管理

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751874

E. Kursun, Chen-Yong Cher

引用次数: 35

Fast arbiters for on-chip network switches 片上网络交换机的快速仲裁器

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751932

G. Dimitrakopoulos, N. Chrysos, C. Galanopoulos

引用次数: 47

CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework CrashTest:基于fpga的快速高保真弹性分析框架

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751886

Andrea Pellegrini, Kypros Constantinides, Dan Zhang, Shobana Sudhakar, V. Bertacco, T. Austin

{"title":"CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework","authors":"Andrea Pellegrini, Kypros Constantinides, Dan Zhang, Shobana Sudhakar, V. Bertacco, T. Austin","doi":"10.1109/ICCD.2008.4751886","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751886","url":null,"abstract":"Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131725245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Acceleration of a 3D target tracking algorithm using an application specific instruction set processor 使用特定指令集处理器的3D目标跟踪算法的加速

2008 IEEE International Conference on Computer Design Pub Date : 2008-10-01 DOI: 10.1109/ICCD.2008.4751870

S. Fontaine, Sylvain Goyette, J. Langlois, G. Bois

引用次数: 9