M. Wirthlin, J. E. Jensen, Alex Wilson, W. Howes, Shi-Jie Wen, R. Wong
{"title":"Placement of repair circuits for in-field FPGA repair","authors":"M. Wirthlin, J. E. Jensen, Alex Wilson, W. Howes, Shi-Jie Wen, R. Wong","doi":"10.1145/2435264.2435286","DOIUrl":"https://doi.org/10.1145/2435264.2435286","url":null,"abstract":"With the growing density and shrinking feature size of modern semiconductors, it is increasingly difficult to manufacture defect free semiconductors that maintain acceptable levels of reliability for long periods of time. These systems are increasingly susceptible to wear-out by failing to meet their operational specifications for an extended period of time. The reconfigurability of FPGAs can be used to repair post-manufacturing faults by configuring the FPGA to avoid a damaged resource. This paper presents a method for repairing FPGA devices with wear-out faults by precomputing a set of repair circuits that, collectively, can repair a fault found in any logic block of the FPGA. This approach relies on logic placement to create \"repair\" circuits that avoid specific logic blocks. Three repair placement algorithms will be presented that generate a complete set of repair designs during the conventional placement process. The number of repairs needed to create a complete repair set depends heavily on the utilization of the FPGA resources. The three algorithms are tested against several benchmarks and with multiple area constraints for each benchmark. The best repair placement approach described in the paper generates a full set of repair circuits at a computation cost of 16X that of a conventional placer and with circuits of comparable quality.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"115-124"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87565234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Word-length optimization beyond straight line code","authors":"D. Boland, G. Constantinides","doi":"10.1145/2435264.2435285","DOIUrl":"https://doi.org/10.1145/2435264.2435285","url":null,"abstract":"The silicon area benefits that result from word-length optimization have been widely reported by the FPGA community. However, to date, most approaches are restricted to straight line code, or code that can be converted into straight line code using techniques such as loop-unrolling. In this paper, we take the first steps towards creating analytical techniques to optimize the precision used throughout custom FPGA accelerators for algorithms that contain loops with data dependent exit conditions. To achieve this, we build on ideas emanating from the software verification community to prove program termination. Our idea is to apply word-length optimization techniques to find the minimum precision required to guarantee that a loop with data dependent exit conditions will terminate. Without techniques to analyze algorithms containing these types of loops, a hardware designer may elect to implement every arithmetic operator throughout a custom FPGA-based accelerator using IEEE-754 standard single or double precision arithmetic. With this approach, the FPGA accelerator would have comparable accuracy to a software implementation. However, we show that using our new technique to create custom fixed and floating point designs, we can obtain silicon area savings of up to 50% over IEEE standard single precision arithmetic, or 80% over IEEE standard double precision arithmetic, at the same time as providing guarantees that the created hardware designs will work in practice.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"105-114"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84289593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-level synthesis with LegUp: a crash course for users and researchers","authors":"J. Anderson, S. Brown, Andrew Canis, Jongsok Choi","doi":"10.1145/2435264.2435269","DOIUrl":"https://doi.org/10.1145/2435264.2435269","url":null,"abstract":"High-level synthesis (HLS) has been gaining traction recently as a design methodology for FPGAs, with the promise of raising the productivity of FPGA hardware designers, and ultimately, opening the door to the use of FPGAs as computing devices targetable by software engineers. In this tutorial, we introduce LegUp, an open-source HLS tool for FPGAs developed at the University of Toronto. With LegUp, a user can compile a C program completely to hardware, or alternately, he/she can choose to compile the program to a hybrid hardware/software system comprising a processor along with one or more accelerators. LegUp supports the synthesis of most of the C language to hardware, including loops, structs, multi-dimensional arrays, pointer arithmetic, and floating point operations. The LegUp distribution includes the CHStone HLS benchmark suite, as well as a test suite and associated infrastructure for measuring quality of results, and for verifying the functionality of LegUp-generated circuits. LegUp is freely downloadable at www.legup.org, providing a powerful platform that can be leveraged for new high-level synthesis research.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"25 1","pages":"7-8"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82683345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Defect recovery in nanodevice-based programmable interconnects (abstract only)","authors":"J. Cong, Bingjun Xiao","doi":"10.1145/2435264.2435343","DOIUrl":"https://doi.org/10.1145/2435264.2435343","url":null,"abstract":"This work focuses on defect tolerance for nanodevice-based programmable interconnects of FPGAs. A single nanodevice can function as a routing switch in place of a pass transistor and its six-transistor SRAM cell in conventional FPGAs. Defects of nanodevices in programmable interconnects are manifested as losses of configurability and can be categorized into stuck- open defect and stuck- closed defect. First, we show that the stuck-closed defects of nanodevices have a much higher impact than the stuck-open defects. Instead of simply avoiding the stuck-closed defects, we recover them by treating them as shorting constraints in the routing. We develop a scalable algorithm to perform timing-driven routing under these extra constraints. We extend the idea of the resource negotiation to balance the goals of timing and routability under shorting constraints. We also develop several techniques to guide the router to map the shorting clusters to those nets with more shared paths for better utilization of routing resources while automatically balancing it with circuit performance. We also enhance the placement algorithm to recover logic blocks which become virtually unusable due to shorted pins. Simulation results show that at the up-to-date level of nanodevice defects (108-1011x higher than CMOS), compared to the simple avoidance method, our approach reduces the degradation of resource usage by 87%, improves the routability by 37%, and reduce the degradation of circuit performance by 36%, at a negligible overhead of tool runtime.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"277-278"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90214192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High throughput and programmable online trafficclassifier on FPGA","authors":"Da Tong, Lu Sun, Kiran Kumar Matam, V. Prasanna","doi":"10.1145/2435264.2435307","DOIUrl":"https://doi.org/10.1145/2435264.2435307","url":null,"abstract":"Machine learning (ML) algorithms have been shown to be effective in classifying the dynamic internet traffic today. Using additional features and sophisticated ML techniques can improve accuracy and can classify a broad range of application classes. Realizing such classifiers to meet high data rates is challenging. In this paper, we propose two architectures to realize complete online traffic classifier using flow-level features. First, we develop a traffic classifier based on C4.5 decision tree algorithm and Entropy-MDL discretization algorithm. It achieves an accuracy of 97.92% when classifying a traffic trace consisting of eight application classes. Next, we accelerate our classifier using two architectures on FPGA. One architecture stores the classifier in on-chip distributed RAM. It is designed to sustain a high throughput. The other architecture stores the classifier in block RAM. It is designed to operate with small hardware footprint and thus built at low hardware cost. Experimental results show that our high throughput architecture can sustain a throughput of $550$ Gbps assuming 40 Byte packet size. Our low cost architecture demonstrates a 22% better resource efficiency than the high throughput design. It can be easily replicated to achieve $449$ Gbps while supporting 160 input traffic streams concurrently. Both architectures are parameterizable and programmable to support any binary-tree-based traffic classifier. We develop a tool which allows users to easily map a binary-tree-based classifier to hardware. The tool takes a classifier as input and automatically generates the Verilog code for the corresponding hardware architecture.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"79 1","pages":"255-264"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84805164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Circuit optimizations to minimize energy in the global interconnect of a low-power--FPGA (abstract only)","authors":"Oluseyi A. Ayorinde, B. Calhoun","doi":"10.1145/2435264.2435341","DOIUrl":"https://doi.org/10.1145/2435264.2435341","url":null,"abstract":"We compare circuit and architecture choices in the global interconnect of an FPGA in order to find the minimum energy design for low voltage operation. We look at switch box topology, number of repeaters, receiver circuit topology, and dynamic voltage selection, all with the intent of minimizing energy consumption. The results show that using a pass gate switchbox topology with repeaters in the interconnect and a custom receiver lowers delay by up to 63% and energy by up to 87% from the standard FPGA circuit choices. This work also identifies the optimal VDD choices to maximize performance under energy constraints or vice versa.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"277"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74089841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards automatic customization of interconnect and memory in the CoRAM abstraction (abstract only)","authors":"Eric S. Chung, Michael Papamichael","doi":"10.1145/2435264.2435311","DOIUrl":"https://doi.org/10.1145/2435264.2435311","url":null,"abstract":"When developing applications to run on FPGAs, we tend to expend great effort on crafting the custom hardware acceleration datapath---but blindly turn to the FPGA vendor tool/library to provide default solutions for on-chip interconnect and external interfaces. This often leads to ineffective communication- or memory-bound implementations since the design and tuning of the default general-purpose solutions necessarily makes design compromises for generality. This is counterproductive as the FPGA's flexible reconfigurability should afford us great opportunities for performance gain and cost reduction through extensive application-specific customization of the interconnect and interface IPs. This work presents a compiler that generates custom interconnect topology and connectivity with appropriately scaled capacity to support an application's exact communication requirements at a minimized cost. More specifically, the compiler analyzes an application developed for the CoRAM abstraction [1,2] for its connectivity and bandwidth requirements between the hardware processing kernels and external DRAM banks. The result is an extremely fine-tuned custom-topology soft-logic network-on-chip interconnect, which is enabled by the CONNECT NoC framework [3].\u0000 We perform an extensive evaluation that benchmarks two applications against the standard CoRAM implementation flow that relies on a fixed generically-tuned general-purpose soft-logic network-on-chip. Our RTL-driven evaluation shows a large opportunity for area reduction and improved efficiency (up by 48%) without any impact on application performance.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"23 1","pages":"265"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83762190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qian Zhao, Kazuki Inoue, M. Amagasaki, M. Iida, M. Kuga, T. Sueyoshi
{"title":"A novel FPGA design framework with VLSI post-routing performance analysis (abstract only)","authors":"Qian Zhao, Kazuki Inoue, M. Amagasaki, M. Iida, M. Kuga, T. Sueyoshi","doi":"10.1145/2435264.2435327","DOIUrl":"https://doi.org/10.1145/2435264.2435327","url":null,"abstract":"The most widely used open-source field-programmable gate array (FPGA) placement and routing tool is VPR, which can define the target FPGA, perform placement and routing, and report area and timing information. However, it cannot be used in FPGA IP design efficiently for two reasons. First, for most newly developed FPGA architectures, VPR cannot support them directly. Modifying the C-coded VPR for using it to evaluate a number of new architectures requires a long time. Second, the accuracy of the VPR performance results is not enough for the evaluation of a complete synthesizable FPGA IP in the design that targets the productions of LSI. We propose a FPGA design framework that in particular improves FPGA IP design efficiency. A novel FPGA routing tool is developed in this framework, namely EasyRouter. EasyRouter is developed using the C# language. When an object-oriented programming method is used, the source codes are fewer and easier manage compared to VPR, which shortens the development time. By using simple HDL templates, EasyRouter can automatically generate entire chip HDL codes and the configuration bitstream. With these files, the FPGA IP can be evaluated with commercial VLSI CADs with high accuracy and reliability.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"7 1","pages":"271"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82698241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware implemented real-time operating system (abstract only)","authors":"Soon Ee Ong, Siaw Chen Lee, N. Ali","doi":"10.1145/2435264.2435314","DOIUrl":"https://doi.org/10.1145/2435264.2435314","url":null,"abstract":"Real-Time Operating System (RTOS) usually implemented as software component at fundamental layer of embedded system which consumes computing time and memory resources. This will introduce extra overhead and latency to the system. In addition to this, the software layer of RTOS also indirectly raises the complexity of system software. Shifting RTOS from software to hardware is an inspiring idea to abstract RTOS layer out from the embedded system software framework. It has the advantages of helping to reduce the system software complexity, as well as improves the system performance by reducing overhead and latency of RTOS. This paper presented a Simple and Efficient hardware implemented Real-Time Operating System (SEOS) architected for high portability and scalability. SEOS operates at co-processor level as an independent hardware component. It contains all essential OS services needed for embedded system design. This includes kernel scheduler, inter-task communication and synchronization (i.e. mutex, semaphore, mailbox), timer and IRQ handler. The application software interfaces with SEOS through a set of standard Application Programming Interface (API). Furthermore, SEOS is also equipped with Generic Bus Interface and Interconnect Bridge to enable effortless OS porting across different processor platforms. These innovative approaches have made SEOS to be plug-and-play in nature. Our test result shows that SEOS is having performance improvement over commercial software based RTOS, µC/OS-II, in several areas. SEOS consumes 31.6% less overhead in context switching, improves task level interrupt latency by 83.5%, shorten inter-task communication latency by 71.9% and significantly improves on performance jitter.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2004 1","pages":"266"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86263897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (abstract only)","authors":"P. Cooke, J. Fowers, Lee Hunt, G. Stitt","doi":"10.1145/2435264.2435344","DOIUrl":"https://doi.org/10.1145/2435264.2435344","url":null,"abstract":"Computer-vision and signal-processing applications often require feature tracking to identify and track the motion of different objects (features) across a sequence of images. Numerous algorithms have been proposed, but common measures of similarity for real-time usage are either based on correlation, mean-squared error, or sum of absolute differences, which are not robust enough for safety-critical applications. To improve robustness, a recent feature-tracking algorithm called C-Flow uses correntropy from Information Theoretic Learning to significantly improve signal-to-noise ratio. In this paper, we present an FPGA accelerator for C-Flow that is typically 3.6-8.5x faster than a GPU and show that the FPGA is the only device capable of real-time usage for large features. Furthermore, we show the FPGA accelerator is more appropriate for embedded usage, with energy consumption that is 2.5-22x less than the GPU.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"278"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90803534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}