M. Aguilar, Juan Fernando Eusse Giraldo, Projjol Ray, R. Leupers, G. Ascheid, Weihua Sheng, Prashant Sharma
{"title":"Parallelism extraction in embedded software for android devices","authors":"M. Aguilar, Juan Fernando Eusse Giraldo, Projjol Ray, R. Leupers, G. Ascheid, Weihua Sheng, Prashant Sharma","doi":"10.1109/SAMOS.2015.7363654","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363654","url":null,"abstract":"In the last years the presence of embedded devices in everyday life has grown exponentially. The market of these devices imposes conflicting requirements such as cost, performance and energy. The use of Multiprocessor Systems on Chip (MPSoCs) is a widely accepted solution to provide a trade-off between these demands. However, programming MPSoCs is still a cumbersome task. Several research efforts have addressed this challenge in two complementary directions: paradigms for parallel programming and tools for parallelism extraction. However, most of these efforts are focused on the high performance domain and they do not consider the characteristics of the underlying platform. In this paper, we present an approach to extract multiple forms of parallelism from sequential C code, which is applied to widespread Android mobile devices. We show the effectiveness of our work by parallelizing relevant embedded benchmarks on a quad-core Nexus 7 tablet.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129472216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU implementation of an anisotropic Huber-L1 dense optical flow algorithm using OpenCL","authors":"Duygu Buyukaydin, Toygar Akgün","doi":"10.1109/SAMOS.2015.7363693","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363693","url":null,"abstract":"Optical flow estimation aims at inferring a dense pixel-wise correspondence field between two images or video frames. It is commonly used in video processing and computer vision applications, including motion-compensated frame processing, extracting temporal features, computing stereo disparity, understanding scene context/dynamics and understanding behavior. Dense optical flow estimation is a computationally complex problem. Fortunately, a wide range of optical flow estimation algorithms are embarrassingly parallel and can efficiently be accelerated on GPUs. In this work we discuss a massively multi-threaded GPU implementation of the anisotropic Huber-L1 optical flow estimation algorithm using OpenCL framework, which achieves per frame execution time speed-up factors up to almost 300×. Overall algorithm flow, GPU specific implementation details and performance results are presented.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software fault tolerance for FPUs via vectorization","authors":"Zhi Chen, R. Inagaki, A. Nicolau, A. Veidenbaum","doi":"10.1109/SAMOS.2015.7363677","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363677","url":null,"abstract":"Future generation processors are expected to have high soft error rates and will require increased fault detection and fault tolerance. This work focuses on errors in execution units. Hardware or software duplication or triplication, parity, or residue codes could be used to detect errors in execution units. However, hardware duplication/triplication have significant area overhead and, in applications with high utilization of floating point units (FPU), very high energy cost. Software duplication/ triplication of instructions also increases both execution time and energy consumption. This paper proposes to reduce the cost of redundant instruction execution in FPUs through vectorization. Duplicated or triplicated instructions and result comparisons can be packed by a compiler into vector instructions, such as SSE or AVX. Experimental results using hand vectorization on a variety of benchmarks show that, compared to error detection through scalar instruction duplication, vector mode redundant execution achieves 1.78× and 2.73× average speedup for SSE and AVX instructions, respectively. It also significantly reduces the energy consumption, by an average of 40% and 53%, respectively, for SSE and AVX. Thus the proposed technique enables error detection with no hardware cost and reduced time and energy overhead compared to brute-force scalar instruction duplication.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117327099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Aoun, Liliana Andrade, Torsten Mähne, F. Pêcheux, M. Louërat, A. Vachoux
{"title":"Pre-simulation elaboration of heterogeneous systems: The SystemC multi-disciplinary virtual prototyping approach","authors":"C. Aoun, Liliana Andrade, Torsten Mähne, F. Pêcheux, M. Louërat, A. Vachoux","doi":"10.1109/SAMOS.2015.7363686","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363686","url":null,"abstract":"Designers of the upcoming digital-centric More-than-Moore systems are lacking a common design and simulation environment able to efficiently manage all the multi-disciplinary aspects of its components of various nature that closely interact with each other. A key to successful design and verification lies in a SystemC-based virtual prototyping environment that is able to simulate a complex heterogeneous system as a whole, for which each component is described and solved using the most appropriate Model of Computation (MoC). In this paper, we present a new generic MoC-independent elaboration scheme that aims at preparing a Virtual Prototype (VP) for simulation. It requires to check the correct composition of the system model through dimensional analysis, to explore the model structure to identify involved MoC and interfaces between MoCs, and to detect the underlying dependencies. Eventually, information extracted from the exploration allow the instantiation of MoC-specific solvers. To soundly handle the global model execution with a Discrete Event (DE) kernel as the main solver, synchronization mechanisms with master-slave semantics within the model structure are implicitly deduced.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134102493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chip-independent Error Correction in main memories","authors":"Mehrtash Manoochehri, M. Dubois","doi":"10.1109/SAMOS.2015.7363674","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363674","url":null,"abstract":"Main memory reliability is an important concern in today's computer systems. Error Correction Codes (ECCs) improve memory reliability but have high area and energy overheads. Furthermore, ECCs cannot be easily applied to memories with wide chips such as stacked memories. In this paper, we introduce a new low-overhead error correction scheme, which can easily be applied to DRAM memories with wide devices. The scheme is called Chip-Independent Error Correction (CIEC) because it is independent of the memory chip width. Our simulation results in the context of transient faults show that CIEC has only 4.5% energy overhead, 0.5% performance overhead, and 0.7% area overhead on the processor chip as compared to a non-ECC DIMM while its reliability is much higher than the reliability of non-ECC DIMMs.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133602144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Duric, Milan Stanic, Ivan Ratković, Oscar Palomar, O. Unsal, A. Cristal, M. Valero, Aaron Smith
{"title":"Imposing coarse-grained reconfiguration to general purpose processors","authors":"M. Duric, Milan Stanic, Ivan Ratković, Oscar Palomar, O. Unsal, A. Cristal, M. Valero, Aaron Smith","doi":"10.1109/SAMOS.2015.7363658","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363658","url":null,"abstract":"Mobile devices execute applications with diverse compute and performance demands. This paper proposes a general purpose processor that adapts the underlying hardware to a given workload. Existing mobile processors need to utilize more complex heterogeneous substrates to deliver the demanded performance. They incorporate different cores and specialized accelerators. On the contrary, our processor utilizes only modest homogeneous cores and dynamically provides an execution substrate suitable to accelerate a particular workload. Instead of incorporating accelerators, the processor reconfigures one or more cores into accelerators on-the-fly. It improves performance with minimal hardware additions. The accelerators are made of general purpose ALUs reconfigured into a compute fabric and the general purpose pipeline that streams data through the fabric. To enable reconfiguration of ALUs into the fabric, the floorplan of a 4-core processor is changed to place the ALUs in close proximity on the chip. A configurable switched network is added to couple and dynamically reconfigure the ALUs to perform computation of frequently repeated regions, instead of executing general purpose instructions. Through this reconfiguration, the mobile processor specializes its substrate for a given workload and maximizes performance of the existing resources. Our results show that reconfiguration accelerates a set of selected compute intensive workloads by 1.56×, 2,39×, 3,51×, when configuring the accelerator of 1-, 2-, or 4- cores respectively.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131551818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miguel Bordallo López, A. Nieto, O. Silvén, J. Boutellier, D. L. Vilariño
{"title":"Reconfigurable computing for future vision-capable devices","authors":"Miguel Bordallo López, A. Nieto, O. Silvén, J. Boutellier, D. L. Vilariño","doi":"10.1109/SAMOS.2015.7363657","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363657","url":null,"abstract":"Mobile devices have been identified as promising platforms for interactive vision-based applications. However, this type of applications still pose significant challenges in terms of latency, throughput and energy-efficiency. In this context, the integration of reconfigurable architectures on mobile devices allows dynamic reconfiguration to match the computation and data flow of interactive applications, demonstrating significant performance benefits compared to general purpose architectures. This paper presents concepts laying on platform level adaptability, exploring the acceleration of vision-based interactive applications through the utilization of three reconfigurable architectures: A low-power EnCore processor with a Configurable Flow Accelerator co-processor, a hybrid reconfigurable SIMD/MIMD platform and Transport-Triggered Architecture-based processors. The architectures are evaluated and compared with current processors, analyzing their advantages and weaknesses in terms of performance and energy-efficiency when implementing highly interactive vision-based applications. The results show that the inclusion of reconfigurable platforms on mobile devices can enable the computation of several computationally heavy tasks with high performance and small energy consumption while providing enough flexibility.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127791102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tervel: A unification of descriptor-based techniques for non-blocking programming","authors":"S. Feldman, P. Laborde, D. Dechev","doi":"10.1109/SAMOS.2015.7363668","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363668","url":null,"abstract":"The development of non-blocking code is difficult; developers must ensure the progress of an operation on shared memory despite conflicting operations. Managing this shared memory in a non-blocking fashion is even more problematic. The non-blocking property guarantees that progress is made toward the desired operation in a finite amount of time. We present a framework that implements memory reclamation and progress assurance for code that follows the semantics of our framework. This reduces the effort required to implement non-blocking, and more specifically wait-free, algorithms. We also present a library that demonstrates the ease with which wait-free algorithms can be implemented using our framework.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129128148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware task migration module for improved fault tolerance and predictability","authors":"Shyamsundar Venkataraman, Rui Santos, Akash Kumar, Jasper Kuijsten","doi":"10.1109/SAMOS.2015.7363676","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363676","url":null,"abstract":"Task migration has been applied as an efficient mechanism to handle faulty processing elements (PEs) in Multi-processor Systems-on-Chip (MPSoCs). However, current task migration solutions are either implemented or emulated in software, compromising intrinsically the predictability and degrading the system robustness. Moreover, the initial placement and mapping of the tasks in the MPSoC plays an important role in minimising the task migration overhead and overall system energy. This paper proposes a hardware-based task migration scheme for MPSoC systems, offering better predictability as well as an improved method of fault tolerance. The proposed scheme intelligently generates an initial placement for the tasks with improved fault tolerance and stores these mappings on a hash map, which is looked up at run-time as and when faults occur. Compared with the state-of-the-art, our scheme performs up to 1500× faster task migration without any significant overheads.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126278755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm","authors":"Ernst Houtgast, V. Sima, K. Bertels, Z. Al-Ars","doi":"10.1109/SAMOS.2015.7363679","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363679","url":null,"abstract":"We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3× faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129637952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}