Saehanseul Yi, Illo Yoon, Chanyoung Oh, Youngmin Yi
{"title":"Real-time integrated face detection and recognition on embedded GPGPUs","authors":"Saehanseul Yi, Illo Yoon, Chanyoung Oh, Youngmin Yi","doi":"10.1109/ESTIMedia.2014.6962350","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2014.6962350","url":null,"abstract":"Both face detection and face recognition have started to be used widely these days in various applications such as biometric, surveillance, security, advertisement, entertainment, and so on. The ever increasing input image size in face detection and the large input DB in face recognition keep requiring more computational power to achieve real-time processing. Recently, embedded GPUs have started to support OpenCL and many applications can be accelerated successfully as the server GPUs have. In this paper, we propose several optimization techniques for the Local Binary Pattern (LBP) based integrated face detection and recognition algorithms, and successfully accelerated them achieving 22 fps using OpenCL on ARM Mali GPU, and 38 fps using CUDA on Tegra K1 GPU for HD inputs. This corresponds to 2.9 times and 3.7 times speedups respectively. To the best of our knowledge, it is the first paper that presents the acceleration of the face detection on embedded GPGPUs, and also that presents the performance of Tegra K1 GPU.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127630349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software platform for hybrid resource management of a many-core accelerator for multimedia applications","authors":"Sungchan Kim, Chanhee Lee, Taeyoung Kim, S. Ha","doi":"10.1109/ESTIMedia.2014.6962341","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2014.6962341","url":null,"abstract":"As incessant demand of higher computing capability makes a many-core accelerator become a major computing resource in a System-on-Chip, a variety of many-core architectures and resource management techniques have been proposed recently. They usually assume a specific hardware architecture and a specific resource management scheme. In this paper, we propose a generic software platform that implements a hybrid resource management technique, targeting for a wide range of many-core architectures. To evaluate the system performance more accurately before SoC fabrication, we run it on a virtual prototyping system. The actual implementation enables us to investigate the overheads involved in the propose software platform. Preliminary experimental results confirm that the proposed software platform adapts to the dynamic workload variation effectively by dynamic mapping of tasks and tolerate unexpected core failures by check-pointing. We address our perspective on future research issues to make the generic software platform a reality.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134070923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youngsub Ko, Saehanseul Yi, Youngmin Yi, Myungsun Kim, S. Ha
{"title":"Hardware-in-the-loop simulation of Android GPGPU applications","authors":"Youngsub Ko, Saehanseul Yi, Youngmin Yi, Myungsun Kim, S. Ha","doi":"10.1109/ESTIMedia.2014.6962351","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2014.6962351","url":null,"abstract":"Emerging mobile devices are likely to adopt CPU-GPU heterogeneous architecture where an embedded GPU executes offloaded computations from the CPU as well as rendering tasks. For design space exploration of such a CPU-GPU heterogeneous architecture at the early design stage or for monitoring the dynamic system behavior of a system, it is very desirable to run the same application software on a full system simulation platform without modification. Since simulations will be performed repetitively, compromise should be made between simulation speed and timing accuracy. Since all known GPU simulators are very slow, in this paper, we propose a hardware-in-the-loop (HIL) simulation framework that integrates the CPU simulator with an existent GPU hardware. A novel interfacing mechanism between the CPU simulator and the GPU hardware is devised to guarantee functional correctness. The proposed technique maintains the timing accuracy of computation workload as much as possible with unavoidable penalty on the timing accuracy of CPU-GPU communication overhead. The proposed simulation framework is implemented with a gem5 full-system simulator and various kinds of GPGPU hardware. For a real-life scenario, we ported the Android platform to the proponativesed simulation framework and ran a face detection application that calls a native function via JNI. The native function can be written in CUDA or OpenCL if it will be offloaded to the GPU, or in Pthreads if it will be run on the CPU. Preliminary experiments show some use cases of the proposed simulation framework for design space exploration and dynamic behavior monitoring.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124885854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Tretter, Harshavardhan Pandit, Pratyush Kumar, L. Thiele
{"title":"Deterministic memory sharing in Kahn process networks: Ultrasound imaging as a case study","authors":"Andreas Tretter, Harshavardhan Pandit, Pratyush Kumar, L. Thiele","doi":"10.1109/ESTIMedia.2014.6962348","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2014.6962348","url":null,"abstract":"Kahn process networks are a popular programming model for programming multi-core systems. They ensure determinacy of applications by restricting processes to separate memory regions, only allowing communication over FIFO channels. However, many modern multi-core platforms concentrate on shared memory as a means of communication and data exchange. In this work, we present a concept for deterministic memory sharing in Kahn process networks. It allows to take advantage of shared memory data exchange mechanisms on such platforms while still preserving determinacy. We show how any Kahn process network can be transformed to use deterministic memory sharing by giving a set of transformations that can be applied selectively, only looking at one process at a time. We demonstrate how these techniques can be applied to an ultrasound image reconstruction algorithm. For an implementation on a test system, our technique yields significantly better performance combined with a drastically smaller memory footprint.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126640257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alok Lele, Orlando Moreira, J. Bastos, Ricardo Almeida, P. Pedreiras, K. V. Berkel
{"title":"Analyzing preemptive fixed priority scheduling of data flow graphs","authors":"Alok Lele, Orlando Moreira, J. Bastos, Ricardo Almeida, P. Pedreiras, K. V. Berkel","doi":"10.1109/ESTIMedia.2014.6962345","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2014.6962345","url":null,"abstract":"Data flow graphs can conveniently model embedded streaming applications (ESAs) that are typically implemented as networks of concurrent tasks having an iterative pipelined execution, where the activation of each task may be conditioned by intra- and inter-iteration data dependencies. We propose a novel analysis approach for preemptive Fixed Priority Scheduling (FPS) of multiple ESAs assuming a fixed mapping of tasks onto the processors of the underlying Heterogeneous Multi-Processor System-on-Chip (HMPSoC). The tasks of an ESA are event activated, have varying execution times, and participate in cyclic dependency chains such that they may not have an activation pattern that can be depicted using traditional periodic / sporadic event models. Instead we propose to characterize the data flow graphs of ESAs to upper bound the load they impose on a processor and use it to compute the worst-case response time of an actor executing on that processor at a lower priority. We show that ours is a generic approach for analyzing FPS of data flow graphs. We also propose a refinement of our technique for graphs with a dominant periodic source. We demonstrate our improvement over the state-of-the-art FPS analysis for data flow in our experiments.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129646956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C.C.-H. Hsu, Cheng-Yen Lin, Shin-Kai Chen, Chih-Wei Liu, Jenq-Kuen Lee
{"title":"Optimized memory access support for data layout conversion on heterogeneous multi-core systems","authors":"C.C.-H. Hsu, Cheng-Yen Lin, Shin-Kai Chen, Chih-Wei Liu, Jenq-Kuen Lee","doi":"10.1109/ESTIMedia.2014.6962353","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2014.6962353","url":null,"abstract":"Heterogeneous multi-core systems that contain multiple CPUs and GPUs are gaining momentum, as they are providing different computation power to meet the performance demand of modern applications. On such systems, developers try to fully utilize the computation power both for CPU and GPU by using the emerging programming models such as CUDA and OpenCL. To achieve the maximal performance, developers must carefully offload the appropriate workload to the compute devices according to the characteristics of target architecture. Under such scenario, seamlessly data motion between different processors become crucial. Additionally, re-organizing the data layout to fit the target architectures, such as array-of-structure (AOS) for CPU, structure-of-array (SOA) for GPU, and coordinate (COO) format to ELLPACK (ELL) for sparse computation, address such concern. In this paper, we propose a hardware memory manager, which efficiently optimizes the conversion of data layouts for heterogeneous multi-core systems on-the-fly. We address coalescing and sparse format conversion issue in our design. A novel ping-pong transpose architecture is devised to reorganize non-coalescing access pattern, and a histogram unit and sparse address generator are presented to process sparse storage format transformation. Our design reduces the overhead of data transfer and layout transformation among CPU and GPU. In our experiment, our design achieves 68.5 to 2.19 times speed up comparing to software-based library depending on data size.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125696414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}