{"title":"Effect of message length and processor speed on the performance of the bidirectional ring-based multiprocessor","authors":"Hitoshi Oi, Nagarajan Ranganathan","doi":"10.1109/ICCD.1997.628878","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628878","url":null,"abstract":"This paper presents a comparative study of the performance of the bidirectional ring and the unidirectional ring multiprocessor, with emphasis on the effect of system parameters, specifically, the message length and the relative processor speed. The choice of these parameters may not be optimum due to the performance cost tradeoffs in practice. Our study shows that the use of bidirectional ring is more effective in such suboptimum system configurations and can improve the processor utilization by up to 35%.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130761271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A floating-point divider using redundant binary circuits and an asynchronous clock scheme","authors":"Hiroaki Suzuki, H. Makino, K. Mashiko, H. Hamano","doi":"10.1109/ICCD.1997.628939","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628939","url":null,"abstract":"This paper describes a new floating-point divider (FDIV) using redundant binary circuits on an asynchronous clock scheme for an internal iterative operation. The redundant binary representation of +1=(1,0), 0=(0,0), -1+(0,1) is applied to the all mantissa division circuits. The simple and unified representation reduces circuit delay for the quotient determination. Additionally, the asynchronous clock reduces a clock margin overhead. The architecture design avoids post processes, whose main role is to produce the floating-point status flags. The FDIV core using proposed technologies operates at 42.1 ns with 0.35 /spl mu/m CMOS technology and triple metal interconnections. The small core of 13.5 k transistors is laid-out in 730 /spl mu/m/spl times/910 /spl mu/m area.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132648251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Post-layout circuit speed-up by event elimination","authors":"H. Vaishnav, Chi-Keung Lee, Massoud Pedram","doi":"10.1109/ICCD.1997.628870","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628870","url":null,"abstract":"We propose a novel technique for post-layout delay optimization. This technique identifies the Boolean space corresponding to late arriving transitions at the outputs of delay-critical subcircuits within the given circuit. The transitions are eliminated from the outputs by implementing the corresponding logic separately and merging them with the original circuit through some control logic. Experimental results suggest that this technique can speed up circuits even when the circuits have already been optimized for delay to the fullest extent.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122002197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingbin Zhang, Ali Dasdan, M. Schulz, Rajesh K. Gupta, A. Chien
{"title":"Architectural adaptation for application-specific locality optimizations","authors":"Xingbin Zhang, Ali Dasdan, M. Schulz, Rajesh K. Gupta, A. Chien","doi":"10.1109/ICCD.1997.628862","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628862","url":null,"abstract":"We propose a machine architecture that integrates programmable logic into key components of the system with the goal of customizing architectural mechanisms and policies to match an application. This approach presents an improvement over the traditional approach of exploiting programmable logic as a separate co-processor by pre-serving machine usability through software and on a traditional computer architecture by providing application-specific hardware. We present two case studies of architectural customization to enhance latency tolerance and efficiently utilize network bisection on multiprocessors for sparse matrix computations. We demonstrate that application-specific hardware and policies can provide substantial improvements in performance on a per application basis. Based on these preliminary results, we propose that an application-driven machine customization provides a promising approach to achieve high performance and combat performance fragility.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125895102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instruction prefetching using branch prediction information","authors":"I-Cheng K. Chen, Chih-Chieh Lee, T. Mudge","doi":"10.1109/ICCD.1997.628926","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628926","url":null,"abstract":"Instruction prefetching can effectively reduce instruction cache misses, thus improving the performance. In this paper, we propose a prefetching scheme, which employs a branch predictor to run ahead of the execution unit and to prefetch potentially useful instructions. Branch prediction-based (BP-based) prefetching has a separate small fetching unit, allowing it to compute and predict targets autonomously. Our simulations show that a 4-issue machine with BP-based prefetching achieves higher performance than a plain cache 4 times the size. In addition, BP-based prefetching outperforms other hardware instruction fetching schemes, such as next-n line prefetching and wrong-path prefetching, by a factor of 17-44% in stall overhead.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127171236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Formally specifying and mechanically verifying programs for the Motorola complex arithmetic processor DSP","authors":"B. Brock, W. Hunt","doi":"10.1109/ICCD.1997.628846","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628846","url":null,"abstract":"We describe our formal specification of Motorola's Complex Arithmetic Processor (CAP) DSP and our subsequent use of this specification to verify the correctness of several DSP algorithms. We wrote the specification in the ACL2 logic and carried out the mechanical proofs using the ACL2 theorem-proving system. Motorola's CAP is a super-scalar, pipelined DSP with seven memories and more than 20 functional units. Our formal specification is bit-for-bit exact, and was created by hand translating Motorola's drawings for the CAP. We believe that the specification developed is the largest of its kind, as this is the only formal specification of which we are aware for a complete commercial design. Proving the correctness of the DSP algorithms (programs) required proving the correctness of programs with 317-bit instructions and a non-interlocking execution pipeline. This Motorola DSP has a 1.8 million transistor implementation. This project involved both CLI and Motorola personnel and represents more than eight man-years of effort.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129134822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A low power smart vision system based on active pixel sensor integrated with programmable neural processor","authors":"W. Fang, Guang Yang, B. Pain, B. Sheu","doi":"10.1109/ICCD.1997.628905","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628905","url":null,"abstract":"A low power smart vision system based on a large format (currently 1 K/spl times/1 K) active pixel sensor (APS) integrated with a programmable neural processor for fast vision applications is presented. The concept of building a low power smart vision system is demonstrated by a system design which is composed with an APS sensor, a smart image window handler, and a neural processor. The paper also shows that it is feasible to put the whole smart vision system into a single chip in a standard CMOS technology. This smart vision system on-a-chip can take the combined advantages of the optics and electronics to achieve ultra-high-speed smart sensory information processing and analysis at the focal plane. The proposed system will enable many applications including robotics and machine vision, guidance and navigation, automotive applications, and consumer electronics. Future applications will also include scientific sensors such as those suitable for highly integrated imaging systems used in NASA deep space and planetary spacecraft.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132071106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. D. Mehta, Yao-Ping Chen, N. Menezes, D. F. Wong, L. Pileggi
{"title":"Clustering and load balancing for buffered clock tree synthesis","authors":"A. D. Mehta, Yao-Ping Chen, N. Menezes, D. F. Wong, L. Pileggi","doi":"10.1109/ICCD.1997.628871","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628871","url":null,"abstract":"Buffers in clock trees introduce two additional sources of skew: the first source of skew is the effect of process variations on buffer delays. The second source of skew is the imbalance in buffer loading. We propose a buffered clock tree synthesis methodology whereby we first apply a clustering algorithm to obtain clusters of approximately equal capacitance loading. We drive each of these clusters with identical buffers. A sensitivity based approach is then used for equalizing the Elmore delay from the buffer output to all of the clock nodes. The skew due to load imbalance is minimized concurrently by matching a higher-order model of the load by wire sizing and wire lengthening. We demonstrate how this algorithm can be used recursively to generate low-skew buffered clock trees.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126419434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High level test synthesis across the boundary of behavioral and structural domains","authors":"Kowen Lai, C. Papachristou, M. Baklashov","doi":"10.1109/ICCD.1997.628932","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628932","url":null,"abstract":"High level test synthesis (HLTS), a term introduced in recent years, promises automatic enhancement of testability of a circuit. The authors show how HLTS can achieve higher testability for BIST oriented test methodologies. Their results show considering testability during high-level synthesis, better testability can be obtained when compared to DFT at low level. Transformation for testability, which allows behavioral modification for testability, is a very powerful HLTS technique.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"1036 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123130878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous retiming: algorithms and applications","authors":"P. Pan","doi":"10.1109/ICCD.1997.628857","DOIUrl":"https://doi.org/10.1109/ICCD.1997.628857","url":null,"abstract":"This paper introduces a continuous version of retiming (called c-retiming). As retiming, a c-retiming of a circuit is also an assignment of values to the nodes in the circuit. However, values in c-retiming can be real numbers as opposed to integers in retiming. Retiming and c-retiming are strongly related. In fact, a c-retiming can be converted to a retiming by a simple rounding, and the potential degradation in clock period is less than the largest gate delay in a circuit. C-retiming has two very attractive properties. It can be computed much more efficiently than retiming. Consequently, one can compute a retiming by computing a proper c-retiming. Our experimental results indicate this approach can drastically speed up the solution of retiming problems. More importantly, c-retiming can be combined with circuit modifications. Because of this property, c-retiming can be used as a tool to study synthesis and optimization problems in conjunction with retiming. We demonstrate this using the classical tree mapping problem, for which we derive an algorithm that produces a solution with a clock period provably close to optimal while considering retiming.","PeriodicalId":154864,"journal":{"name":"Proceedings International Conference on Computer Design VLSI in Computers and Processors","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114198921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}