{"title":"Adaptive interfacing with reconfigurable computers","authors":"N. Bergmann, Anwar S. Dawood","doi":"10.1109/ACAC.2001.903347","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903347","url":null,"abstract":"A reconfigurable computer consists of reconfigurable logic circuits added to a conventional processor to give a computer where both the hardware and the software can be programmed on an application by application basis. Despite significant research, reconfigurable computers have failed to gain widespread acceptance as a high-speed computing replacement for conventional supercomputers. This paper describes the reasons for this failure and argues that the domain of real-time, reactive computer systems provides a better potential application area. An experimental Adaptive Instrument Module, based on reconfigurable reactive computing technology will be flown on the FedSat low earth orbit satellite to test out these ideas.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117122045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DStride: data-cache miss-address-based stride prefetching scheme for multimedia processors","authors":"G. Hariprakash, R. Achutharaman, A. Omondi","doi":"10.1109/ACAC.2001.903360","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903360","url":null,"abstract":"Prefetching reduces cache miss latency by moving data up in memory hierarchy before they are actually needed. Recent hardware-based stride prefetching techniques mostly rely on the processor pipeline information (e.g. program counter and branch prediction table) for prediction. Continuing developments in processor microarchitecture drastically change core pipeline design and require that existing hardware-based stride prefetching techniques be adapted to the evolving new processor architectures. In this paper we present a new hardware-based stride prefetching technique, called DStride, that is independent of processor pipeline design changes. In this new design, the first-level data cache miss address stream is used for the stride prediction. The miss addresses are separated into load stream and store stream to increase the efficiency of the predictor. They are checked separately against the recent miss address stream to detect the strides. The detected steady strides are maintained in a table that also performs look-ahead stride prefetching when the processor stride reference rate is higher than the prefetch request service rate. We evaluated our design with multimedia workloads using execution-driven simulation with SimpleScalar toolset. Our experiments show that DStride is very effective in reducing overall pipeline stalls due to cache miss latency, especially for stride-intensive applications such as multimedia workloads.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"192 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120899124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-tolerant routing on Complete Josephus Cubes","authors":"P. Loh, W. Hsu","doi":"10.1109/ACAC.2001.903366","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903366","url":null,"abstract":"This paper introduces the Complete Josephus Cube, a fault-tolerant class of the recently proposed Josephus Cube and proposes a cost-effective, fault-tolerant routing strategy for the Complete Josephus Cube. For a Complete Josephus Cube of order r, the routing algorithm can tolerate up to (r+1) encountered component faults in its message path and generates routes that are both deadlock-free and livelock-free. The message is guaranteed to be optimally (respectively, sub-optimally) delivered within a maximum of r (respectively, 2r+1) hops. The message overhead incurred is only a single (r+2)-bit routing vector accompanying the message to be communicated.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123843766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Password-capabilities: their evolution from the password-capability system into Walnut and beyond","authors":"R. Pose","doi":"10.1109/ACAC.2001.903370","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903370","url":null,"abstract":"Since we first devised and defined password capabilities as a new technique for building capability-based operating systems, a number of research systems around the world have used them as the bases for a variety of operating systems. Our original Password-Capability System was implemented on custom built hardware with a novel address translation and protection scheme specifically designed to support password-capabilities. The password-capability concept later formed the basis of Opal developed at the University of Washington, and Mungi from the University of New South Wales, both of which used commercially available hardware. A second generation password-capability based system, Walnut, was developed at Monash University in the 1990s. Walnut was designed to run on commercially available hardware. It addressed some shortcomings of the original Password-Capability System but had to sacrifice some features that depended on hardware support. A third generation system that will extend Walnut to support mandatory security policies and other advanced features is currently being considered. This paper analyses the evolution of the Password-Capability System into Walnut, examines the shortcomings of the systems, and identifies issues to be addressed in the new system.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"1992 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128602650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance evaluation of a partial retraining scheme for defective multi-layer neural networks","authors":"K. Yamamori, T. Abe, S. Horiguchi","doi":"10.1109/ACAC.2001.903376","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903376","url":null,"abstract":"This paper addresses an efficient stuck-defect compensation scheme for multi-layer artificial neural networks implemented in hardware devices. To compensate for stuck defects, we have proposed a two-stage partial retraining scheme that adjusts weights belonging to a neuron affected by defects based on back-propagation (BP) algorithm between two layers. For input neurons, the partial retraining scheme is applied two times; first-stage between the input layer and the hidden layer, second-stage between the hidden layer and the output layer. The partial retraining scheme does not need any additional circuits if the hardware neural network has circuits for learning. In this paper we discuss the performance of the partial retraining scheme, retraining time, network yield and generalization ability. As a result, the partial retraining scheme could compensate the neuron stuck defects about 10 times faster than the whole network retraining by BP algorithm. In addition, yields of networks are also improved. The partial retraining scheme achieved more than 80% recognition ratio for noisy input patterns when 16% neurons of the network have 0-stuck or 1-stuck defects.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122548971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Eeckhout, T. Vander Aa, B. Goeman, H. Vandierendonck, R. Lauwereins, K. De Bosschere
{"title":"Application domains for fixed-length block structured architectures","authors":"L. Eeckhout, T. Vander Aa, B. Goeman, H. Vandierendonck, R. Lauwereins, K. De Bosschere","doi":"10.1109/ACAC.2001.903353","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903353","url":null,"abstract":"In order to tackle the growing complexity and interconnects problem in modern microprocessor architectures, computer architects have come up with new architectural paradigms. A fixed-length block structured architecture (BSA) is one of these paradigms. The basic idea of a BSA is to generate blocks of instructions, called BSA-blocks, statically (by the compiler) and executing these blocks on a decentralized microarchitecture. In this paper, we focus on possible application domains for this architectural paradigm. To investigate this issue, we have set up several experiments with 43 benchmarks coming from the SPECint95, the SPECfp95, the MediaBench suite, plus a set of MPEG-4 like algorithms. The main conclusion of this paper is twofold. First, multimedia applications are less control-intensive than SPECint95 benchmarks and more control-intensive than SPECfp95 benchmarks. As a result, a compiler for a BSA will find more opportunities to fill BSA-blocks with instructions from the actually executed control flow paths for SPECfp95 than for multimedia applications; and more for multimedia applications than for SPECint95. Second, 16 instructions per BSA-block is appropriate for all application domains. Larger BSA-blocks on the other hand, result in higher branch misprediction rates for most applications and lead to a less effective use of the virtual window size.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128314101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Aron, J. Liedtke, Kevin Elphinstone, Yoonho Park, T. Jaeger, Luke Deller
{"title":"The SawMill framework for virtual memory diversity","authors":"M. Aron, J. Liedtke, Kevin Elphinstone, Yoonho Park, T. Jaeger, Luke Deller","doi":"10.1109/ACAC.2001.903345","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903345","url":null,"abstract":"We present a framework that allows applications to build and customize VM services on the LA microkernel. While the LA microkernel's abstractions are quite powerful, using these abstractions effectively requires higher-level paradigms. We propose the dataspace paradigm which provides a modular VM framework. The modularity introduced by the dataspace paradigm facilitates implementation and permits dynamic configurability. Initial performance results from a prototype are promising.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Java instruction/thread level parallelism with horizontal multithreading","authors":"Kenji Watanabe, Wanming Chu, Yamin Li","doi":"10.1109/ACAC.2001.903373","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903373","url":null,"abstract":"Java bytecodes can be executed with the following three methods: a Java interpreter running on a particular machine interprets bytecodes; a Just-in-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the machine executes the translated codes; and a Java processor executes bytecodes directly. The first two methods require no special hardware support for the execution of Java bytecodes and are widely used currently. The last method requires an embedded Java processor, picoJavaI or picoJavaII for instance. The picoJavaI and picoJavaII are simple pipelined processors with no ILP (instruction level parallelism) and TLP (thread level parallelism) supports. A so-called MAJC (microprocessor architecture for Java computing) design can exploit ILP and TLP by using a modified VLIW (very long instruction word) architecture and vertical multithreading technique, but it has its own instruction set and cannot execute Java bytecodes directly. In this paper, we investigate a processor architecture which can directly execute Java bytecodes meanwhile can exploit Java ILP and TLP simultaneously. The proposed processor consists of multiple slots implementing horizontal multithreading and multiple functional units shared by all threads executed in parallel. Our architectural simulation results show that the Java processor could achieve an average 20 IPC (instructions per cycle), or 7.33 EIPC (effective IPC), with 8 slots and a 4-instruction scheduling window for each slot. We also check other configurations and give the utilization of functional units as well as the performance improvement with various kinds of working loads.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132559843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines","authors":"C. Jesshope","doi":"10.1109/ACAC.2001.903363","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903363","url":null,"abstract":"This paper looks at a combination of two techniques, one of which, using a vector instruction set, has a long history dating back to pipelined vector supercomputers, such as the Cray 1 and its successors. The other technique, multi-threading, is also well understood. The novel approach proposed in this paper combines both vertical and horizontal micro-threading with vector instruction descriptors. It will be shown that a family of threads can represent a vector instruction with dependencies between the instances of that family, the iterations. This technique gives a very low overhead in implementing an n-way loop and is able to tolerate high memory latency. The use of micro-threading to handle dependencies between threads provides the ability to trade-off between instruction level parallelism and loop parallelism. The paper describes the means by which instruction classes may be instanced as independent parallel micro-threads and illustrates the speed-up that may be obtained compared to using a conventional loop.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125282484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Retargetable cache simulation using high level processor models","authors":"Rajiv A. Ravindran, R. Moona","doi":"10.1109/ACAC.2001.903371","DOIUrl":"https://doi.org/10.1109/ACAC.2001.903371","url":null,"abstract":"During processor design, it is often necessary to evaluate multiple cache configurations. This paper describes the design and implementation of a retargetable on-line cache simulator. The cache simulator has been implemented using a retargetable instruction set simulator from the Sim-nML processor description language. The retargetability helps in cache simulation and evaluation much before the actual processor design.","PeriodicalId":230403,"journal":{"name":"Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127098997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}