ACM Trans. Embed. Comput. Syst.最新文献_第10页

Automatic synthesis of physical system differential equation models to a custom network of general processing elements on FPGAs 物理系统微分方程模型的自动合成到fpga上一般处理单元的自定义网络

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514641.2514650

Chen-Chun Huang, F. Vahid, T. Givargis

{"title":"Automatic synthesis of physical system differential equation models to a custom network of general processing elements on FPGAs","authors":"Chen-Chun Huang, F. Vahid, T. Givargis","doi":"10.1145/2514641.2514650","DOIUrl":"https://doi.org/10.1145/2514641.2514650","url":null,"abstract":"Fast execution of physical system models has various uses, such as simulating physical phenomena or real-time testing of medical equipment. Physical system models commonly consist of thousands of differential equations. Solving such equations using software on microprocessor devices may be slow. Several past efforts implement such models as parallel circuits on special computing devices called Field-Programmable Gate Arrays (FPGAs), demonstrating large speedups due to the excellent match between the massive fine-grained local communication parallelism common in physical models and the fine-grained parallel compute elements and local connectivity of FPGAs. However, past implementation efforts were mostly manual or ad hoc. We present the first method for automatically converting a set of ordinary differential equations into circuits on FPGAs. The method uses a general Processing Element (PE) that we developed, designed to quickly solve a set of ordinary differential equations while using few FPGA resources. The method instantiates a network of general PEs, partitions equations among the PEs to minimize communication, generates each PE's custom program, creates custom connections among PEs, and maintains synchronization of all PEs in the network. Our experiments show that the method generates a 400-PE network on a commercial FPGA that executes four different models on average 15x faster than a 3 GHz Intel processor, 30x faster than a commercial 4-core ARM, 14x faster than a commercial 6-core Texas Instruments digital signal processor, and 4.4x faster than an NVIDIA 336-core graphics processing unit. We also show that the FPGA-based approach is reasonably cost effective compared to using the other platforms. The method yields 2.1x faster circuits than a commercial high-level synthesis tool that uses the traditional method for converting behavior to circuits, while using 2x fewer lookup tables, 2x fewer hardcore multiplier (DSP) units, though 3.5x more block RAM due to being programmable. Furthermore, the method does not just generate a single fastest design, but generates a range of designs that trade off size and performance, by using different numbers of PEs.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133954151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Efficient compilation of CUDA kernels for high-performance computing on FPGAs 高效编译CUDA内核在fpga上的高性能计算

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514641.2514652

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, J. Cong, Wen-mei W. Hwu

{"title":"Efficient compilation of CUDA kernels for high-performance computing on FPGAs","authors":"Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, J. Cong, Wen-mei W. Hwu","doi":"10.1145/2514641.2514652","DOIUrl":"https://doi.org/10.1145/2514641.2514652","url":null,"abstract":"The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121327347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Hardware architectural support for control systems and sensor processing 硬件架构支持控制系统和传感器处理

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514641.2514643

Sudhanshu Vyas, Adwait Gupte, C. Gill, R. Cytron, Joseph Zambreno, Phillip H. Jones

{"title":"Hardware architectural support for control systems and sensor processing","authors":"Sudhanshu Vyas, Adwait Gupte, C. Gill, R. Cytron, Joseph Zambreno, Phillip H. Jones","doi":"10.1145/2514641.2514643","DOIUrl":"https://doi.org/10.1145/2514641.2514643","url":null,"abstract":"The field of modern control theory and the systems used to implement these controls have shown rapid development over the last 50 years. It was often the case that those developing control algorithms could assume the computing medium was solely dedicated to the task of controlling a plant, for example, the control algorithm being implemented in software on a dedicated Digital Signal Processor (DSP), or implemented in hardware using a simple dedicated Programmable Logic Device (PLD). As time progressed, the drive to place more system functionality in a single component (reducing power, cost, and increasing reliability) has made this assumption less often true. Thus, it has been pointed out by some experts in the field of control theory (e.g., Astrom) that those developing control algorithms must take into account the effects of running their algorithms on systems that will be shared with other tasks. One aspect of the work presented in this article is a hardware architecture that allows control developers to maintain this simplifying assumption. We focus specifically on the Proportional-Integral-Derivative (PID) controller. An on-chip coprocessor has been implemented that can scale to support servicing hundreds of plants, while maintaining microsecond-level response times, tight deterministic control loop timing, and allowing the main processor to service noncontrol tasks.\u0000 In order to control a plant, the controller needs information about the plant's state. Typically this information is obtained from sensors with which the plant has been instrumented. There are a number of common computations that may be performed on this sensor data before being presented to the controller (e.g., averaging and thresholding). Thus in addition to supporting PID algorithms, we have developed a Sensor Processing Unit (SPU) that off-loads these common sensor processing tasks from the main processor.\u0000 We have prototyped our ideas using Field Programmable Gate Array (FPGA) technology. Through our experimental results, we show our PID execution unit gives orders of magnitude improvement in response time when servicing many plants, as compared to a standard general software implementation. We also show that the SPU scales much better than a general software implementation. In addition, these execution units allow the simplifying assumption of dedicated computing medium to hold for control algorithm development.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128653736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Multicore-based vector coprocessor sharing for performance and energy gains 基于多核的矢量协处理器共享，以获得性能和能量增益

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514641.2514644

S. F. Beldianu, Sotirios G. Ziavras

{"title":"Multicore-based vector coprocessor sharing for performance and energy gains","authors":"S. F. Beldianu, Sotirios G. Ziavras","doi":"10.1145/2514641.2514644","DOIUrl":"https://doi.org/10.1145/2514641.2514644","url":null,"abstract":"For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125744720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Introduction to the special issue on application-specific processors 介绍特定于应用程序的处理器的特殊问题

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514641.2514642

P. Brisk, T. Mitra

{"title":"Introduction to the special issue on application-specific processors","authors":"P. Brisk, T. Mitra","doi":"10.1145/2514641.2514642","DOIUrl":"https://doi.org/10.1145/2514641.2514642","url":null,"abstract":"Introduction to the Special Issue on Application-Specific Processors Application-specific processors offer performance, energy, and cost benefits compared to their general-purpose counterparts for a wide variety of market segments, ranging from low-cost micro-controllers to high-end supercomputers. The design and usage of application-specific processors are strikingly different from general-purpose comput- ers as the architecture is tuned to accelerate only a specific application or a class of applications. Application-specific architectures are often derived from a high-level specification of an application through a hardware/software co-design process, which can be highly automated; such a process would be inappropriate for general-purpose processor de- sign. Moreover, code compilation targeting application-specific processors is often inter- twined with the co-design process. Lastly, the system software may provide reduced and targeted functionality and interfaces at a much lower cost than a traditional embedded or real-time operating system such as iOS or Android. Historically, application-specific processors have been part of low-cost embedded devices; however, the landscape is rapidly changing. One important factor is the emer- gence of graphics processing units (GPUs) as general-purpose high-performance computing devices, which utilize graphics-specific hardware, but can also provide sig- nificant acceleration boosts for vectorizable applications. Similarly, reconfigurable computing devices, such as FPGAs, provide a flexible hardware fabric that can be used to implement a wide variety of application-specific functionality and as such are increasingly used for code acceleration in different domains. This special issue of ACM Transactions on Embedded Computing Systems is dedi- cated to all aspects of application-specific processors. Part of this special issue presents extended versions of some of the best papers that were presented at the IEEE Sym- posium on Application-Specific Processors (SASP) in 2009 and 2010. Altogether, we received 66 submitted manuscripts for the special issue, 10 of which were accepted for inclusion in this special issue. It is our great honor to introduce the articles. Our first article, “Hardware Architectural Support for Control Systems and Sensor Processing” by Sundhanshu Vyas, Adwait Gupta, Christopher Gill, Ron K. Cytron, Joseph Zambreno and Philip Jones, describes a microcontroller that has been cus- tomized to control thousands of PID controllers concurrently. Most application-specific processors in the past have extended RISCs or VLIWs. This article represents a land- mark effort in the design of customizable application-specific microcontrollers. This special issue features three articles that focus on the architecture of application- specific processors and the design of application-specific accelerators, namely, “Multicore-Based Vector Coprocessor Sharing for Performance and Energy Gains” by Spiridon F. B","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124533546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A systematic approach for optimized bypass configurations for application-specific embedded processors 一种针对特定应用的嵌入式处理器优化旁路配置的系统方法

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514641.2514645

T. Jungeblut, Boris Hübener, Mario Porrmann, U. Rückert

{"title":"A systematic approach for optimized bypass configurations for application-specific embedded processors","authors":"T. Jungeblut, Boris Hübener, Mario Porrmann, U. Rückert","doi":"10.1145/2514641.2514645","DOIUrl":"https://doi.org/10.1145/2514641.2514645","url":null,"abstract":"The diversity of today's mobile applications requires embedded processor cores with a high resource efficiency, that means, the devices should provide a high performance at low area requirements and power consumption. The fine-grained parallelism supported by multiple functional units of VLIW architectures offers a high throughput at reasonable low clock frequencies compared to single-core RISC processors. To efficiently utilize the processor pipeline, common system architectures have to cope with data hazards due to data dependencies between consecutive operations. On the one hand, such hazards can be resolved by complex forwarding circuits (i.e., a pipeline bypass) which forward intermediate results to a subsequent instruction. On the other hand, the pipeline bypass can strongly affect or even dominate the total resource requirements and degrade the maximum clock frequency. In this work the CoreVA VLIW architecture is used for the development and the analysis of application-specific bypass configurations. It is shown that many paths of a comprehensive bypass system are rarely used and may not be required for certain applications. For this reason, several strategies have been implemented to enhance the efficiency of the total system by introducing application-specific bypass configurations. The configuration can be carried out statically by only implementing required paths or at runtime by dynamically reconfiguring the hardware. An algorithm is proposed which derives an optimized configuration by iteratively disabling single bypass paths. The adaptation of these application-specific bypass configurations allows for a reduction of the critical path by 26%. As a result, the execution time and energy requirements could be reduced by up to 21.5%. Using Dynamic Frequency Scaling (DFS) and dynamic deactivation/reactivation of bypass paths allows for a runtime reconfiguration of the bypass system. This ensures the highest efficiency while processing varying applications.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"20 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120998470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems LegUp:基于fpga的处理器/加速器系统的开源高级合成工具

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI: 10.1145/2514740

Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Tomasz S. Czajkowski, S. Brown, J. Anderson

{"title":"LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems","authors":"Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Tomasz S. Czajkowski, S. Brown, J. Anderson","doi":"10.1145/2514740","DOIUrl":"https://doi.org/10.1145/2514740","url":null,"abstract":"It is generally accepted that a custom hardware implementation of a set of computations will provide superior speed and energy efficiency relative to a software implementation. However, the cost and difficulty of hardware design is often prohibitive, and consequently, a software approach is used for most applications. In this article, we introduce a new high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface. In the hybrid processor/accelerator architecture, program segments that are unsuitable for hardware implementation can execute in software on the processor. LegUp can synthesize most of the C language to hardware, including fixed-sized multidimensional arrays, structs, global variables, and pointer arithmetic. Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool. We also give results demonstrating the ability of the tool to explore the hardware/software codesign space by varying the amount of a program that runs in software versus hardware. LegUp, along with a set of benchmark C programs, is open source and freely downloadable, providing a powerful platform that can be leveraged for new research on a wide range of high-level synthesis topics.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123018214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 316

Contextual partitioning for speech recognition 语音识别的上下文划分

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-08-01 DOI: 10.1145/2501626.2501639

Christopher G. Kent, J. M. Paul

引用次数: 3

Software thread integration for instruction-level parallelism 指令级并行的软件线程集成

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-08-01 DOI: 10.1145/2512466

Won So, A. Dean

{"title":"Software thread integration for instruction-level parallelism","authors":"Won So, A. Dean","doi":"10.1145/2512466","DOIUrl":"https://doi.org/10.1145/2512466","url":null,"abstract":"Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy.\u0000 This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time.\u0000 We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128814380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A software-only scheme for managing heap data on limited local memory(LLM) multicore processors

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-08-01 DOI: 10.1145/2501626.2501632

Ke Bai, Aviral Shrivastava

引用次数: 5