An Adaptive Dynamic Scheduling Scheme for H.264/AVC Decoding on Multicore Architecture

Dung Vu, Jilong Kuang, L. Bhuyan
{"title":"An Adaptive Dynamic Scheduling Scheme for H.264/AVC Decoding on Multicore Architecture","authors":"Dung Vu, Jilong Kuang, L. Bhuyan","doi":"10.1109/ICME.2012.9","DOIUrl":null,"url":null,"abstract":"Parallelizing H.264/AVC decoding on multicore architectures is challenged by its inherent structural and functional dependencies at both frame and macro-block levels, as macro-blocks and certain frame types must be decoded in a sequential order. So far, dynamic scheduling scheme with recursive tail submit, as one of the best existing algorithms, provides a good throughput performance by exploiting macro-block level parallelism and mitigating global queue contention. Nevertheless, it fails to achieve an optimal performance due to 1) the use of global queue, which incurs substantial synchronization overhead when the number of cores increases and 2) the unawareness of cache locality with respect to the underlying hierarchical core/cache topology that results in unnecessary latency, communication cost and load imbalance. In this paper, we propose an adaptive dynamic scheduling scheme that employs multiple local queues to reduce lock contention, and assigns tasks in a cache locality aware and load-balancing fashion so that neighboring macro-blocks are preferably dispatched to nearby cores. We design, implement and evaluate our scheme on a 32-core cc-NUMA SGI server. Compared to existing alternatives by running real benchmark applications, we observe that our scheme produces higher throughput and lower latency with more balanced workload and less communication cost.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Multimedia and Expo","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2012.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Parallelizing H.264/AVC decoding on multicore architectures is challenged by its inherent structural and functional dependencies at both frame and macro-block levels, as macro-blocks and certain frame types must be decoded in a sequential order. So far, dynamic scheduling scheme with recursive tail submit, as one of the best existing algorithms, provides a good throughput performance by exploiting macro-block level parallelism and mitigating global queue contention. Nevertheless, it fails to achieve an optimal performance due to 1) the use of global queue, which incurs substantial synchronization overhead when the number of cores increases and 2) the unawareness of cache locality with respect to the underlying hierarchical core/cache topology that results in unnecessary latency, communication cost and load imbalance. In this paper, we propose an adaptive dynamic scheduling scheme that employs multiple local queues to reduce lock contention, and assigns tasks in a cache locality aware and load-balancing fashion so that neighboring macro-blocks are preferably dispatched to nearby cores. We design, implement and evaluate our scheme on a 32-core cc-NUMA SGI server. Compared to existing alternatives by running real benchmark applications, we observe that our scheme produces higher throughput and lower latency with more balanced workload and less communication cost.
一种多核H.264/AVC解码的自适应动态调度方案
在多核架构上并行进行H.264/AVC解码受到帧和宏块级别的固有结构和功能依赖的挑战,因为宏块和某些帧类型必须按顺序解码。具有递归尾提交的动态调度方案利用宏块级并行性和减轻全局队列争用,具有良好的吞吐量性能,是目前已有的最佳调度算法之一。然而,由于以下原因,它无法实现最佳性能:1)使用全局队列,当内核数量增加时,会产生大量同步开销;2)相对于底层分层核心/缓存拓扑,无法感知缓存位置,导致不必要的延迟、通信成本和负载不平衡。在本文中,我们提出了一种自适应动态调度方案,该方案使用多个本地队列来减少锁争用,并以缓存位置感知和负载均衡的方式分配任务,从而使相邻的宏块更好地分配到附近的核心。我们在一个32核cc-NUMA SGI服务器上设计、实现和评估了我们的方案。与运行真实基准测试应用程序的现有替代方案相比,我们观察到我们的方案产生更高的吞吐量和更低的延迟,同时工作负载更平衡,通信成本更低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信