An Adaptive Dynamic Scheduling Scheme for H.264/AVC Decoding on Multicore Architecture

2012 IEEE International Conference on Multimedia and Expo Pub Date : 2012-07-09 DOI:10.1109/ICME.2012.9

Dung Vu, Jilong Kuang, L. Bhuyan

{"title":"An Adaptive Dynamic Scheduling Scheme for H.264/AVC Decoding on Multicore Architecture","authors":"Dung Vu, Jilong Kuang, L. Bhuyan","doi":"10.1109/ICME.2012.9","DOIUrl":null,"url":null,"abstract":"Parallelizing H.264/AVC decoding on multicore architectures is challenged by its inherent structural and functional dependencies at both frame and macro-block levels, as macro-blocks and certain frame types must be decoded in a sequential order. So far, dynamic scheduling scheme with recursive tail submit, as one of the best existing algorithms, provides a good throughput performance by exploiting macro-block level parallelism and mitigating global queue contention. Nevertheless, it fails to achieve an optimal performance due to 1) the use of global queue, which incurs substantial synchronization overhead when the number of cores increases and 2) the unawareness of cache locality with respect to the underlying hierarchical core/cache topology that results in unnecessary latency, communication cost and load imbalance. In this paper, we propose an adaptive dynamic scheduling scheme that employs multiple local queues to reduce lock contention, and assigns tasks in a cache locality aware and load-balancing fashion so that neighboring macro-blocks are preferably dispatched to nearby cores. We design, implement and evaluate our scheme on a 32-core cc-NUMA SGI server. Compared to existing alternatives by running real benchmark applications, we observe that our scheme produces higher throughput and lower latency with more balanced workload and less communication cost.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Multimedia and Expo","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2012.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Parallelizing H.264/AVC decoding on multicore architectures is challenged by its inherent structural and functional dependencies at both frame and macro-block levels, as macro-blocks and certain frame types must be decoded in a sequential order. So far, dynamic scheduling scheme with recursive tail submit, as one of the best existing algorithms, provides a good throughput performance by exploiting macro-block level parallelism and mitigating global queue contention. Nevertheless, it fails to achieve an optimal performance due to 1) the use of global queue, which incurs substantial synchronization overhead when the number of cores increases and 2) the unawareness of cache locality with respect to the underlying hierarchical core/cache topology that results in unnecessary latency, communication cost and load imbalance. In this paper, we propose an adaptive dynamic scheduling scheme that employs multiple local queues to reduce lock contention, and assigns tasks in a cache locality aware and load-balancing fashion so that neighboring macro-blocks are preferably dispatched to nearby cores. We design, implement and evaluate our scheme on a 32-core cc-NUMA SGI server. Compared to existing alternatives by running real benchmark applications, we observe that our scheme produces higher throughput and lower latency with more balanced workload and less communication cost.

查看原文本刊更多论文

一种多核H.264/AVC解码的自适应动态调度方案

在多核架构上并行进行H.264/AVC解码受到帧和宏块级别的固有结构和功能依赖的挑战，因为宏块和某些帧类型必须按顺序解码。具有递归尾提交的动态调度方案利用宏块级并行性和减轻全局队列争用，具有良好的吞吐量性能，是目前已有的最佳调度算法之一。然而，由于以下原因，它无法实现最佳性能:1)使用全局队列，当内核数量增加时，会产生大量同步开销;2)相对于底层分层核心/缓存拓扑，无法感知缓存位置，导致不必要的延迟、通信成本和负载不平衡。在本文中，我们提出了一种自适应动态调度方案，该方案使用多个本地队列来减少锁争用，并以缓存位置感知和负载均衡的方式分配任务，从而使相邻的宏块更好地分配到附近的核心。我们在一个32核cc-NUMA SGI服务器上设计、实现和评估了我们的方案。与运行真实基准测试应用程序的现有替代方案相比，我们观察到我们的方案产生更高的吞吐量和更低的延迟，同时工作负载更平衡，通信成本更低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE International Conference on Multimedia and Expo

自引率

0.00%

发文量