Mitsugu Suzuki, Nobuhisa Fujinami, Takeaki Fukuoka, Tan Watanabe, I. Nakata
{"title":"SIMD optimization in COINS compiler infrastructure","authors":"Mitsugu Suzuki, Nobuhisa Fujinami, Takeaki Fukuoka, Tan Watanabe, I. Nakata","doi":"10.1109/IWIA.2005.40","DOIUrl":"https://doi.org/10.1109/IWIA.2005.40","url":null,"abstract":"COINS is a compiler infrastructure that makes it easy to construct a new compiler by adding/modifying only part of the COINS of compiling/optimization features. SIMD optimization is a major advantage. We present an overview of COINS and some topics on its SIMD optimization.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125206630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-thread processor architecture based on the continuation model","authors":"T. Matsuzaki, S. Amamiya, M. Izumi, M. Amamiya","doi":"10.1109/IWIA.2005.22","DOIUrl":"https://doi.org/10.1109/IWIA.2005.22","url":null,"abstract":"We are developing the Fuce processor based on the dataflow computing model. Fuce means fusion of communication and execution. In order to execute many threads with multiple thread execution units efficiently, the Fuce processor executes multiple threads using the exclusive multi-thread execution model. The core concept of the exclusive multi-thread execution model is continuation based multi-thread execution, which is derived from dataflow computing. The Fuce processor aims to fuse the intra-processor execution and inter-processor communication. The Fuce processor unifies processing inside the processor and communication with processors outside as events, and executes the event as a thread. In this paper, we introduce the architecture of the Fuce processor and evaluate the concurrency performance of a Fuce processor which we described in VHDL. As a result, we understood that the processor has concurrency capability when there is sufficient thread level parallelism.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Tanabe, A. Kitamura, T. Miyashiro, Y. Miyabe, T. Izawa, Y. Hamada, H. Nakajo, H. Amano
{"title":"Preliminary evaluations of a FPGA-based-prototype of DIMMnet-2 network interface","authors":"N. Tanabe, A. Kitamura, T. Miyashiro, Y. Miyabe, T. Izawa, Y. Hamada, H. Nakajo, H. Amano","doi":"10.1109/IWIA.2005.38","DOIUrl":"https://doi.org/10.1109/IWIA.2005.38","url":null,"abstract":"Performance improvement of interconnection networks for a PC cluster brings a bottleneck in a standard I/O bus such as PCI bus. DIMMnet is a network interface plugged into a memory slot instead of standard I/O buses. This strategy is one of the solutions in order to balance growing performance with future micro processors. DIMMnet-2 is a prototype which can be plugged into a DDR-DIMM slot to confirm its functions. In this paper, outline of FPGA-based DIMMnet-2 prototype and improvements from DIMMnet-1 to DIMMnet-2 are mentioned. Although the DIMMnet-2 uses an FPGA instead of an ASIC, the latency for writing 8 bytes into remote memory is only 0.948 /spl mu/s. It is about 3 times fewer latency than that of a high performance commercial network interface QsNET II plugged into PCI-X bus on Intel-based IA32 PC. The delay of CoreLogic part for BOTF sending of FPGA based DIMMnet-2 is 5.75 times as fast as that of DIMMnet-1.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114903462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PRESTOR-1: a processor extending multithreaded architecture","authors":"K. Tanaka","doi":"10.1109/IWIA.2005.39","DOIUrl":"https://doi.org/10.1109/IWIA.2005.39","url":null,"abstract":"Multithreaded processors are globally spreading. Multithreaded architecture enables fast context switching for tolerating memory access latency and bridging synchronization gap, and thus enables efficient utilization of execution pipelines. However, it cannot avoid all pipeline stalls; stalls still occur when all processor built-in threads are in a wait state or there are not enough threads in a task/process to fill up all available context slots, since the mechanism for switching active threads is effective only for processor built-in threads' contexts. We developed a new multithreaded processor, PRESTOR-1, that increases the virtual number of built-in threads' contexts and enables seamless task/thread switching by allocating and swapping task/thread contexts hierarchically between processor and memory in a multitasking environment. The processor supports real-time applications through hierarchical task/thread allocation based on the task/thread priority and fast response mechanisms for interrupt requests exploiting the multiple-context architecture. Moreover, the processor has reconfigurable caches that provide a priority-based partitioning cache and a FIFO buffer. In this paper, we describe the details of PRESTOR-1.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134560599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hironaka, M. Maeda, K. Tanigawa, T. Sueyoshi, K. Aoyama, T. Koide, H. Mattausch, T. Saito
{"title":"Superscalar processor with multi-bank register file","authors":"T. Hironaka, M. Maeda, K. Tanigawa, T. Sueyoshi, K. Aoyama, T. Koide, H. Mattausch, T. Saito","doi":"10.1109/IWIA.2005.42","DOIUrl":"https://doi.org/10.1109/IWIA.2005.42","url":null,"abstract":"Register files in highly parallel superscalar processors tend to have large chip area and many access ports. This trend causes problems with chip-size, access time and power consumption. As one of the methods for solving these problems, we have proposed a multi-bank register file which realizes small area, high speed and low power consumption. We have proved effectiveness of this method by software simulation, and by detail designing it as synthesizable Verilog-HDL description with a full custom designed multi-bank register file. In this paper, we show the detail architecture of a superscalar processor with the multi-bank register file and its evaluation results.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129231984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Kind of Processor Interface for a System-on-Chip Processor with TIE Ports and TIE Queues of Xtensa LX","authors":"T. Tohara","doi":"10.1109/IWIA.2005.23","DOIUrl":"https://doi.org/10.1109/IWIA.2005.23","url":null,"abstract":"Today, most System-on-a-Chip (SoC) ASIC chips integrate multiple processor cores as well as hard-wired RTL blocks to realize very complex applications. While computation performance of processors increases, data throughput becomes the bottleneck. Moreover, as processors and RTL blocks need to share data and control/status, inter processors/RTL communications become a serious issue. While various system interconnects have been introduced, processor interface architecture remains conceptually the same. To overcome the communication bottleneck, this paper presents a new type of embedded processor interface for SoC design. And, as the actual realization of such an interface, the TIE ports and TIE queues of XtensaLX processor from Tensilica, Inc. is introduced in this paper.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121505684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance evaluation of dynamic network reconfiguration using Detour-UD routing","authors":"T. Yoshinaga, Y. Nishimura","doi":"10.1109/IWIA.2005.37","DOIUrl":"https://doi.org/10.1109/IWIA.2005.37","url":null,"abstract":"Fault-tolerance is an emerging issue for massively parallel computers. This paper describes the performance impact of dynamic network reconfiguration protocols using a fault-tolerant, adaptive deadlock-recovery routing algorithm, Detour-UD, for k-ary n-cubes. We propose a scheme to specify unroutable packets by managing drain-flags in routing tables. We also propose two selective drainage protocols. One protocol drains the unroutable packets specified by the drain-flags after the reconfiguration process. The other protocol drains deadlocked packets to reduce the network load during the reconfiguration process. Our simulation results show that the first protocol helps reduce the number of drainage packets, and the second one keeps the network throughput during the reconfiguration process.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122063769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuum computer architecture for nano-scale and ultra-high clock rate technologies","authors":"T. Sterling, M. Brodowicz","doi":"10.1109/IWIA.2005.27","DOIUrl":"https://doi.org/10.1109/IWIA.2005.27","url":null,"abstract":"Continuum computer architecture (CCA) is a non-von Neumann architecture that offers an alternative to conventional structures as digital technology evolves towards nano-scale and the ultimate flat-lining of Moore's law. Coincidentally, it also defines a model of architecture particularly well suited to logic classes that exhibit ultra-high clock rates (> 100 GHz) such as rapid single flux quantum (RSFQ) gates. CCA eliminates the concept of the \"CPU\" that has dominated computer architecture since its inception more than half a century ago and establishes a new local element that merges the properties of state storage, state transfer, and state operation. A CCA system architecture is a simple multidimensional organization of these elemental blocks and physically may be considered as a new family of cellular computer. But CCA differs dramatically from conventional cellular automata. While both deliver emergent global behavior from the aggregation of local rules and ensuing operation. The CCA emergent behavior is a global general-purpose model of parallel computation, as opposed to simply mimicking some limited phenomenon like heat and mass transfer as do conventional cellular automata. This paper presents the motivation and foundation concepts of CCA and exposes key issues for further work.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"99 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133817890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance comparison of vector-calculations between Itanium2 and other processors","authors":"T. Nanri, Y. Watanabe, H. Sato","doi":"10.1109/IWIA.2005.36","DOIUrl":"https://doi.org/10.1109/IWIA.2005.36","url":null,"abstract":"This paper examines the performance similarity of the Intel Itanium2 processor and a vector processor. From the measurements of vector-calculations on latest scalar processors, Itanium2 shares similar strong points and weak points of performance with VPP5000. For multiplications of dense matrices, Itanium2 and VPP5000 show relatively high sustained-performance to the theoretical peak. For matrix-vector multiplications with sparse matrices, on the other hand, those two processors show poor performance.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122151871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An exploration of the technology space for multi-core memory/logic chips for highly scalable parallel systems","authors":"P. Kogge","doi":"10.1109/IWIA.2005.24","DOIUrl":"https://doi.org/10.1109/IWIA.2005.24","url":null,"abstract":"Chip-level multi-processing, where more than one CPU \"core\" share the same die with significant parts of the memory hierarchy, is appearing with increasing frequency as standard design practice. This paper takes a broader look at how such mixed logic/memory dies may evolve in the future by walking through the latest CMOS roadmap projections, and casting them in terms of the key chip-level system level building blocks. Given the increasing importance of memory density in such systems, especially as we move to single chip-type designs, we pay particular attention to the potential use of not SRAM but leading edge DRAM for many memory structures. The roles of other factors, such as interconnect and power, is also considered.","PeriodicalId":103456,"journal":{"name":"Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114280603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}