PARMA-DITAM@HiPEAC最新文献

Multithread Accelerators on FPGAs: A Dataflow-Based Approach fpga上的多线程加速器:基于数据流的方法

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2022.6

Francesco Ratto, Stefano Esposito, Carlo Sau, L. Raffo, F. Palumbo

引用次数: 1

HPC Application Cloudification: The StreamFlow Toolkit (Invited Paper) 高性能计算应用云化:StreamFlow工具包(特邀论文)

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2021.5

Iacopo Colonnelli, B. Cantalupo, Roberto Esposito, M. Pennisi, C. Spampinato, Marco Aldinucci

引用次数: 1

Just-In-Time Composition of Reconfigurable Overlays (Invited Talk) 可重构叠加的即时合成(特邀演讲)

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2022.2

Rafael Zamacola, A. Otero, Alfonso Rodríguez, E. D. L. Torre

引用次数: 0

BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs BifurKTM: gpu的近似一致分布式事务内存

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2021.2

Samuel Irving, Lu Peng, C. Busch, J. Peir

{"title":"BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs","authors":"Samuel Irving, Lu Peng, C. Busch, J. Peir","doi":"10.4230/OASIcs.PARMA-DITAM.2021.2","DOIUrl":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2","url":null,"abstract":"We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Dataand Controlflow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer. 2012 ACM Subject Classification Computer systems organization → Heterogeneous (hybrid) systems","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"430 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115932865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards Adaptive Multi-Alternative Process Network 面向自适应多备选过程网络

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2021.1

Hasna Bouraoui, Chadlia Jerad, J. Castrillón

{"title":"Towards Adaptive Multi-Alternative Process Network","authors":"Hasna Bouraoui, Chadlia Jerad, J. Castrillón","doi":"10.4230/OASIcs.PARMA-DITAM.2021.1","DOIUrl":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.1","url":null,"abstract":"With the increase of voice-controlled systems, speech based recognition applications are gaining more attention. Such applications need to adapt to hardware platforms to offer the required performance. Given the streaming nature of these applications, dataflow models are a common choice for modelbased design and execution on parallel embedded platforms. However, most of today’s models are built on top of classical static dataflow with adaptivity extensions to express data parallelism. In this paper, we define and describe an approach for algorithmic adaptivity to express richer sets of variants and trade-offs. For this, we introduce multi-Alternative Process Network (mAPN), a high-level abstract representation where several process networks of the same application coexist. We describe an algorithm for automatic generation of all possible alternatives. The mAPN is enriched with meta-data serving to endow the alternatives with annotations in terms of a specific metric, helping to extract the most suitable alternative depending on the available computational resources and application/user constraints. We motivate the approach by the automatic subtitling application (ASA) as use case and run the experiments on an mAPN sample consisting of 12 randomly selected possible variants. 2012 ACM Subject Classification Theory of computation → Streaming models","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129854072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Energy-Aware HEVC Software Decoding On Mobile Heterogeneous Multi-Cores Architectures 基于移动异构多核架构的能量感知HEVC软件解码

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2022.4

Mohammed Bey Ahmed Khernache, Jalil Boukhobza, Yahia Benmoussa, D. Ménard

{"title":"Energy-Aware HEVC Software Decoding On Mobile Heterogeneous Multi-Cores Architectures","authors":"Mohammed Bey Ahmed Khernache, Jalil Boukhobza, Yahia Benmoussa, D. Ménard","doi":"10.4230/OASIcs.PARMA-DITAM.2022.4","DOIUrl":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2022.4","url":null,"abstract":"Video content is becoming increasingly omnipresent on mobile platforms thanks to advances in mobile heterogeneous architectures. These platforms typically include limited rechargeable batteries which do not improve as fast as video content. Most state-of-the-art studies proposed solutions based on parallelism to exploit the GPP heterogeneity and DVFS to scale up/down the GPP frequency based on the video workload. However, some studies assume to have information about the workload before to start decoding. Others do not exploit the asymmetry character of recent mobile architectures. To address these two challenges, we propose a solution based on classification and frequency scaling. First, a model to classify frames based on their type and size is built during design-time. Second, this model is applied for each frame to decide which GPP cores will decode it. Third, the frequency of the chosen GPP cores is dynamically adjusted based on the output buffer size. Experiments on real-world mobile platforms show that the proposed solution can save more than 20% of energy (mJ/Frame) compared to the Ondemand Linux governor with less than 5% of miss-rate. Moreover, it needs less than one second of decoding to enter the stable state and the overhead represents less than 1% of the frame decoding time.","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"216 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134031461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Memory Management for Modelica Simulations Modelica模拟的高效内存管理

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2022.7

Michele Scuttari, Nicola Camillucci, Daniele Cattaneo, F. Terraneo, G. Agosta

{"title":"Efficient Memory Management for Modelica Simulations","authors":"Michele Scuttari, Nicola Camillucci, Daniele Cattaneo, F. Terraneo, G. Agosta","doi":"10.4230/OASIcs.PARMA-DITAM.2022.7","DOIUrl":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2022.7","url":null,"abstract":"The ever increasing usage of simulations in order to produce digital twins of physical systems led to the creation of specialized equation-based modeling languages such as Modelica. However, compilers of such languages often generate code that exploits the garbage collection memory management paradigm, which introduces significant runtime overhead. In this paper we explain how to improve the memory management approach of the automatically generated simulation code. This is achieved by addressing two different aspects. One regards the reduction of the heap memory usage, which is obtained by modifying functions whose resulting arrays could instead be allocated on the stack by the caller. The other aspect regards the possibility of avoiding garbage collection altogether by performing all memory lifetime tracking statically. We implement our approach in a prototype Modelica compiler, achieving an improvement of the memory management overhead of over 10 times compared to a garbage collected solution, and an improvement of 56 times compared to the production-grade compiler OpenModelica. 2012 ACM Subject Classification Software and its engineering → Compilers; Computing methodo-logies → and","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127774580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Precision Tuning in Parallel Applications 并行应用程序中的精密调谐

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2022.5

Gabriele Magnani, Lev Denisov, Daniele Cattaneo, G. Agosta

引用次数: 0

SO(DA)2: End-to-end Generation of Specialized Reconfigurable Architectures (Invited Talk) SO(DA)2:专用可重构架构的端到端生成(特邀演讲)

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2022.1

Antonino Tumeo, Nicolas Bohm Agostini, S. Curzel, Ankur Limaye, Cheng Tan, Vinay C. Amatya, Marco Minutoli, Vito Giovanni Castellana, Ang Li, J. Manzano

{"title":"SO(DA)2: End-to-end Generation of Specialized Reconfigurable Architectures (Invited Talk)","authors":"Antonino Tumeo, Nicolas Bohm Agostini, S. Curzel, Ankur Limaye, Cheng Tan, Vinay C. Amatya, Marco Minutoli, Vito Giovanni Castellana, Ang Li, J. Manzano","doi":"10.4230/OASIcs.PARMA-DITAM.2022.1","DOIUrl":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2022.1","url":null,"abstract":"Modern data analysis applications are complex workflows composed of algorithms with diverse behaviors. They may include digital signal processing, data filtering, reduction, compression, graph algorithms, and machine learning. Their performance is highly dependent on the volume, the velocity, and the structure of the data. They are used in many different domains (from small, embedded devices, to large-scale, high-performance computing systems) but in all cases they need to provide answers with very low latency to enable real-time decision making and autonomy. Coarse-grained reconfigurable arrays (CGRAs), i.e., architectures composed of functional units able to perform complex operations interconnected through a network-on-chip and configure the datapath to map complex kernels, are a promising platform to accelerate these applications thanks to their adaptability. They provide higher flexibility than application-specific integrated circuits (ASICs) while offering increased energy efficiency and faster reconfiguration speed with respect to field-programmable gate arrays (FPGAs). However, designing and specializing CGRAs requires significant efforts. The inherent flexibility of these devices makes the application mapping process equally important to the hardware design generation. To obtain efficient systems, approaches that simultaneously considers software and hardware optimizations are necessary. In this paper, we discuss the Software Defined Architectures for Data Analytics (SO(DA) 2 ) toolchain, an end-to-end hardware/software codesign framework to generate custom reconfigurable architectures for data analytics applications. (SO(DA) 2 ) is composed of a high-level compiler (SODA-OPT) and a hardware generator (OpenCGRA) and can automatically explore and generate optimal CGRA designs starting from high-level programming frameworks. SO(DA) 2 considers partial dynamic reconfiguration as key element of the system design. We discuss the various elements of the framework and demonstrate the flow on the case study of a partial dynamic reconfigurable CGRA design for data streaming applications. Acknowledgements The research described in this paper is part of the Data-Model Convergence (DMC) Initiative at Pacific Northwest National Laboratory. It was conducted under the Laboratory Directed Research and Development Program at PNNL, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy.","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116575029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The Impact of Precision Tuning on Embedded Systems Performance: A Case Study on Field-Oriented Control 精确调谐对嵌入式系统性能的影响:以面向场的控制为例

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI: 10.4230/OASIcs.PARMA-DITAM.2021.3

Gabriele Magnani, Daniele Cattaneo, M. Chiari, G. Agosta

{"title":"The Impact of Precision Tuning on Embedded Systems Performance: A Case Study on Field-Oriented Control","authors":"Gabriele Magnani, Daniele Cattaneo, M. Chiari, G. Agosta","doi":"10.4230/OASIcs.PARMA-DITAM.2021.3","DOIUrl":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.3","url":null,"abstract":"Field Oriented Control (FOC) is an industry-standard strategy for controlling induction motors and other kinds of AC-based motors. This control scheme has a very high arithmetic intensity when implemented digitally – in particular it requires the use of trigonometric functions. This requirement contrasts with the necessity of increasing the control step frequency when required, and the minimization of power consumption in applications where conserving battery life is paramount such as drones. However, it also makes FOC well suited for optimization using precision tuning techniques. Therefore, we exploit the state-of-the-art FixM methodology to optimize a miniapp simulating a typical FOC application by applying precision tuning of trigonometric functions. The FixM approach itself was extended in order to implement additional algorithm choices to enable a trade-off between execution time and code size. With the application of FixM on the miniapp, we achieved a speedup up to 278%, at a cost of an error in the output less than 0.1%. 2012 ACM Subject Classification Hardware → Power estimation and optimization; Software and its engineering → Compilers; Applied computing → Consumer health","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128292128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4