On the efficiency of reductions in /spl mu/-SIMD media extensions

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI:10.1109/PACT.2001.953290

J. Corbal, R. Espasa, M. Valero

{"title":"On the efficiency of reductions in /spl mu/-SIMD media extensions","authors":"J. Corbal, R. Espasa, M. Valero","doi":"10.1109/PACT.2001.953290","DOIUrl":null,"url":null,"abstract":"Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for /spl mu/-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in /spl mu/-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current /spl mu/-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of /spl mu/-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2001.953290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for /spl mu/-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in /spl mu/-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current /spl mu/-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of /spl mu/-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.

查看原文本刊更多论文

关于减少/spl mu/-SIMD介质扩展的效率

许多重要的多媒体应用程序都包含大量的约简操作。尽管一般来说，多媒体应用程序的特点是具有大量的数据级并行性，但减少和累积很难并行化，并且对指令延迟的增加表现出较差的容忍度。这对于/spl mu/-SIMD扩展(如MMX或AltiVec)特别重要。为了克服/spl mu/-SIMD isa的缩减问题，设计人员倾向于包含越来越复杂的指令来处理多媒体中最常见的缩减形式。只要处理器流水线阶段的数量增加，执行这些多媒体指令所需的周期数量就会随着每一代处理器的生成而增加，从而严重影响性能。本文深入讨论了如何在当前/spl mu/-SIMD架构中执行缩减/累积，并评估了使用三种不同风格的/spl mu/-SIMD扩展的近期高侵略性超标量处理器的性能权衡。我们比较了一个类似mmx的替代方案和一个类似mdmx的扩展，它具有打包累加器来解决缩减问题，我们还将其与MOM(一个矩阵寄存器ISA)进行了比较。我们表明，虽然填充累加器有几个优点，但它们引入了人工递归，严重降低了具有大量寄存器和长延迟操作的处理器的性能。另一方面，本文证明了较长的SIMD媒体扩展(如MOM)可以通过利用约简中隐含的关联并行性来充分利用累加器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量