{"title":"Model checking cloud rendering system for the QoS evaluation","authors":"Haoyu Liu, Huahu Xu, Honghao Gao, Danqi Chu","doi":"10.1109/ASAP.2017.7995284","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995284","url":null,"abstract":"This paper briefly introduce a method to evaluate the reliability of a cloud rendering system by using probability models. An extended discrete-time Markov chain (DTMC) is proposed considering the QoS (Quality of Service). Then, some properties defined from 3 aspects give full consideration to the processes of rendering tasks, which can be verified by performing PRISM in a quantitative way. Finally, the experimental results demonstrate that our method can ensure and improve the QoS reliability of the cloud rendering system.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131107325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Massive spatial query on the Kepler architecture","authors":"Yili Gong, Jia Tang, Wenhai Li, Zihui Ye","doi":"10.1109/ASAP.2017.7995267","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995267","url":null,"abstract":"In this paper, we present an optimized framework that can efficiently perform massive spatial queries on the current GPUs. To benefit the widely adopted filter-and-verify paradigm from GPUs, the skewed workloads are first associated with certain cells in a scaled spatial grid, such that the following range verification cost against the massive spatial objects can be significantly reduced. Particularly on the Kepler architecture, we highlight a two-level scheduling method to exploit good data localities by developing a novel dynamic scheduling method. Based on this virtual warp-based scheduling method, groups of threads can compete for the unbalanced tasks to ensure good load balance. We conduct various of skewed workloads with different object positions and query distributions, to evaluate our optimized methods. Experimental results show that, as compared to the existing fixed-size allocation methods, the proposed adaptive scheduling strategies improve the query throughput by one order of magnitude.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125127675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Chin, N. Sakamoto, A. Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, J. Anderson
{"title":"CGRA-ME: A unified framework for CGRA modelling and exploration","authors":"S. Chin, N. Sakamoto, A. Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, J. Anderson","doi":"10.1109/ASAP.2017.7995277","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995277","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) are a style of programmable logic device situated between FPGAs and custom ASICs on the spectrum of programmability, performance, power and cost. CGRAs have been proposed by both academia and industry; however, prior works have been mainly self-contained without broad architectural exploration and comparisons with competing CGRAs. We present CGRA-ME - a unified CGRA framework that encompasses generic architecture description, architecture modelling, application mapping, and physical implementation. Within this framework, we discuss our architecture description language CGRA-ADL, a generic LLVM-based simulated annealing mapper, and a standard cell flow for physical implementation. An architecture exploration case study is presented, highlighting the capabilities of CGRA-ME by exploring a variety of architectures with varying functionality, interconnect, array size, and execution contexts through the mapping of application benchmarks and the production of standard cell designs.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124608434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast and accurate logarithm accelerator for scientific applications","authors":"Jing Chen, Xue Liu","doi":"10.1109/ASAP.2017.7995283","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995283","url":null,"abstract":"Many scientific applications rely on evaluation of elementary functions. Nowadays, high-level programming languages provide their own elementary function libraries in software by using lookup table and/or polynomial approximation. However, one downside is slow since lookup tables could keep cache thrashing and polynomial approximations require a number of iterations to converge. Thus, elementary functions evaluation becomes bottleneck for most scientific applications. With this motivation, we propose a generalized pipelined hardware architecture for elementary functions to accelerate scientific applications. This paper presents a pipelined, single precision logarithm hardware accelerator (SP-LHA). Throughput of SP-LHA is at least 2.5GFLOPS in 65nm ASICs, while the circuit consists of ≈60,000 logic gates. Average accuracy of SP-LHA is 22.5 out of 23 bits, which is achieved by using 7.8KB lookup table and parabolic interpolation.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120876653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and comparative evaluation of GPGPU- and FPGA-based MPSoC ECU architectures for secure, dependable, and real-time automotive CPS","authors":"B. Poudel, N. Giri, Arslan Munir","doi":"10.1109/ASAP.2017.7995256","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995256","url":null,"abstract":"In this paper, we propose and implement two electronic control unit (ECU) architectures for real-time automotive cyber-physical systems that incorporate security and dependability primitives with low resources and energy overhead. These ECUs architectures follow the multiprocessor system-on-chip (MPSoC) design paradigm wherein the ECUs have multiple heterogeneous processing engines with specific functionalities. The first architecture, GED, leverages an ARM-based application processor and a GPGPU-based co-processor. The second architecture, RED, integrates an ARM based application processor with a FPGA-based co-processor. We quantify and compare temporal performance, energy, and error resilience of our proposed architectures for a steer-by-wire case study over CAN, CAN FD, and FlexRay in-vehicle networks. Hardware implementation results reveal that RED and GED can attain a speedup of 31.7× and 1.8×, respectively, while consuming 1.75× and 2× less energy, respectively, than contemporary ECU architectures.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128448578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rishan Senanayake, Namitha Liyanage, Sasindu Wijeratne, Sachille Atapattu, Kasun Athukorala, P. Tharaka, G. Karunaratne, R. Senarath, Ishantha Perera, Ashen Ekanayake, A. Pasqual
{"title":"High performance hardware architectures for Intra Block Copy and Palette Coding for HEVC screen content coding extension","authors":"Rishan Senanayake, Namitha Liyanage, Sasindu Wijeratne, Sachille Atapattu, Kasun Athukorala, P. Tharaka, G. Karunaratne, R. Senarath, Ishantha Perera, Ashen Ekanayake, A. Pasqual","doi":"10.1109/ASAP.2017.7995274","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995274","url":null,"abstract":"Screen content coding (SCC) extension to High Efficiency Video Coding (HEVC) offers substantial compression efficiency over the existing HEVC standard for computer generated content. However, this gain in compression efficiency is achieved at the expense of further computational complexity with several resource hungry coding tools. Hence, extension of SCC to HEVC hardware encoders can be challenging. This paper presents resource efficient hardware designs for two key SCC tools, Intra Block Copy and Palette Coding. Moreover, a new hash search approach is proposed for Intra Block Copy, while a hardware friendly palette indices coding scheme is suggested for Palette Coding. These designs are targeted to achieve the throughput necessary for an 1080p 30 frames/s encoder, and incurs coding loss of 11.4% and 5.1% respectively in all intra configurations. The designs are synthesized for a Virtex-7 VC707 evaluation platform.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114730983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor","authors":"Binyang Li, Bo Li, D. Qian","doi":"10.1109/ASAP.2017.7995268","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995268","url":null,"abstract":"Sea ice model is a typical high performance computing problem. CPU and GPU based parallel method has been proposed to accelerate the simulation process, but it is still hard to meet the large-scale calculation demand due to the compute-intensive nature of the model. Sunway TaihuLight supercomputer use the SW26010 processor as its computing unit and achieves high performance for large-scale scientific computing. In this paper we present a programming framework (PFSI.sw) for sea ice model algorithms based on Sunway many-core processor. Based on this framework, programmer can exploit the parallelism of existing sea ice model algorithms and achieve good performance. Several strategies are introduced to this framework, data dividing, data transfer as well as the load balance are the main aspects we currently concerned. This framework has been implemented and tested with two sea ice model algorithms by using real world dataset on Sunway many-core processors. The experiment demonstrates comparable performance to the traditional parallel implementation on Sunway many-core processor and our framework improves the performance up to 40%.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116245906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of Frequent Itemset Mining on FPGA using SDAccel and Vivado HLS","authors":"V. Dang, K. Skadron","doi":"10.1109/ASAP.2017.7995279","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995279","url":null,"abstract":"Frequent itemset mining (FIM) is a widely-used data-mining technique for discovering sets of frequently-occurring items in large databases. However, FIM is highly time-consuming when datasets grow in size. FPGAs have shown great promise for accelerating computationally-intensive algorithms, but they are hard to use with traditional HDL-based design methods. The recent introduction of Xilinx SDAccel development environment for the C/C++/OpenCL languages allows developers to utilize FPGA's potential without long development periods and extensive hardware knowledge. This paper presents an optimized implementation of an FIM algorithm on FPGA using SDAccel and Vivado HLS. Performance and power consumption are measured with various datasets. When compared to state-of-the-art solutions, this implementation offers up to 3.2× speedup over a 6-core CPU, and has a better energy efficiency as compared with a GPU. Our preliminary results on the new XCKU115 FPGA are even more promising: they demonstrate a comparable performance with a state-of-the-art HDL FPGA implementation and better performance compared to the GPU.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126328164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcel Brand, Frank Hannig, Alexandru Tanase, J. Teich
{"title":"Efficiency in ILP processing by using orthogonality","authors":"Marcel Brand, Frank Hannig, Alexandru Tanase, J. Teich","doi":"10.1109/ASAP.2017.7995282","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995282","url":null,"abstract":"For the next generations of Processor-Arrays-on-Chip (e. g., coarse-grained reconfigurable or programmable arrays)—including more than 100s to 1000s of processing elements—it is very important to keep the on-chip configuration/instruction memories as small as possible. Hence, compilers must take into account the scarceness of available instruction memory and create the code as compact as possible [1]. However, Very Long Instruction Word (VLIW) processors have the well-known problem that compilers typically produce lengthy codes. A lot of unnecessary code is produced due to unused Functional Units (FUs) or repeating operations for single FUs in instruction sequences. Techniques like software pipelining can be used to improve the utilization of the FUs, yet with the risk of code explosion [2] due to the overlapped scheduling of multiple loop iterations or other control flow statements. This is, where our proposed Orthogonal Instruction Processing (OIP) architecture (see Fig. 1) shows benefits in reducing the code size of compute-intensive loop programs. The idea is, contrary to lightweight VLIW processors used in arrays like Tightly Coupled Processor Arrays (TCPAs) [4], to equip each FU with its own instruction memory, branch unit, and program counter, but still let the FUs share the register files as well as input and output signals. This enables a processor to orthogonally execute a loop program. Each FU can execute its own sub-program while exchanging data over the register files. The branch unit and its instruction format have to be slightly changed by introducing a counter to each instruction that determines how often the instruction is repeated until the specified branch is executed. This enables repeating instructions without repeating them in the code. Those kind of processors have to be carefully programmed, e. g., to not run into data dependency problems while optimizing throughput. For solving this resource-constrained modulo scheduling problem, we use techniques based on mixed integer linear programming [5], [3].","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132944435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Ozkan, Oliver Reiche, Frank Hannig, J. Teich
{"title":"Hardware design and analysis of efficient loop coarsening and border handling for image processing","authors":"M. A. Ozkan, Oliver Reiche, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2017.7995273","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995273","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114182345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}