云视频转码的CPU微架构性能表征

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI:10.1109/IISWC50251.2020.00016

Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci

{"title":"云视频转码的CPU微架构性能表征","authors":"Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci","doi":"10.1109/IISWC50251.2020.00016","DOIUrl":null,"url":null,"abstract":"Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"106 1-2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"CPU Microarchitectural Performance Characterization of Cloud Video Transcoding\",\"authors\":\"Yuhan Chen, Jingyuan Zhu, Tanvir Ahmed Khan, Baris Kasikci\",\"doi\":\"10.1109/IISWC50251.2020.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time.\",\"PeriodicalId\":365983,\"journal\":{\"name\":\"2020 IEEE International Symposium on Workload Characterization (IISWC)\",\"volume\":\"106 1-2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Symposium on Workload Characterization (IISWC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC50251.2020.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC50251.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

视频流占所有互联网流量的75%以上。流到终端用户的视频被编码以减小其大小，以便有效地利用互联网流量，并在终端用户的设备上播放时进行解码。视频必须进行转码。，将一种编码格式转换为另一种编码格式，以适应用户对分辨率、帧率和编码格式的不同需求。全球流媒体服务提供商(如YouTube、Netflix和Facebook)使用了大量的转码操作。优化转码的性能以提供几个百分点的加速可以节省数百万美元的计算和能源成本。虽然先前的工作确定了不同类别视频转码操作的微架构特征，但视频转码的其他参数及其对CPU性能的影响尚未得到研究。在这项工作中，我们研究了来自vbench(一个公开可用的云视频基准套件)的所有视频的视频转码的微架构性能。我们分析了领先的多媒体转码软件，FFmpeg及其所有主要可配置参数跨不同复杂性的视频(例如，高运动和频繁场景转换的视频更复杂)。根据我们的分析结果，我们发现了视频转码工作负载的指令缓存、数据缓存和分支预测单元的关键瓶颈。此外，我们观察到这些瓶颈随着转码参数的变化而变化很大。我们利用几种最先进的编译器方法来缓解视频转码操作的性能瓶颈。我们应用AutoFDO，一个反馈导向优化(FDO)工具来提高指令缓存和分支预测性能。为了提高数据缓存性能，我们利用了石墨，一个多面体优化器。在所有视频中，AutoFDO和Graphite的平均速度分别为4.66%和4.42%。我们还设置了具有不同微架构配置的模拟设置，并使用智能调度器探索潜在的改进，该调度器根据转码参数值将转码任务分配给最合适的配置。智能调度器的性能比随机调度器好3.72%，并且在75%的情况下与最佳调度器的性能相匹配。在这项工作中，我们研究了来自vbench(一个公开可用的云视频基准套件)的所有视频的视频转码的微架构性能。我们分析了领先的多媒体转码软件，FFmpeg及其所有主要可配置参数跨不同复杂性的视频(例如，高运动和频繁场景转换的视频更复杂)。根据我们的分析结果，我们发现了视频转码工作负载的指令缓存、数据缓存和分支预测单元的关键瓶颈。此外，我们观察到这些瓶颈随着转码参数的变化而变化很大。我们利用几种最先进的编译器方法来缓解视频转码操作的性能瓶颈。我们应用AutoFDO，一个反馈导向优化(FDO)工具来提高指令缓存和分支预测性能。为了提高数据缓存性能，我们利用了石墨，一个多面体优化器。在所有视频中，AutoFDO和Graphite的平均速度分别为4.66%和4.42%。我们还设置了具有不同微架构配置的模拟设置，并使用智能调度器探索潜在的改进，该调度器根据转码参数值将转码任务分配给最合适的配置。智能调度器的性能比随机调度器好3.72%，并且在75%的情况下与最佳调度器的性能相匹配。我们利用几种最先进的编译器方法来缓解视频转码操作的性能瓶颈。我们应用AutoFDO，一个反馈导向优化(FDO)工具来提高指令缓存和分支预测性能。为了提高数据缓存性能，我们利用了石墨，一个多面体优化器。在所有视频中，AutoFDO和Graphite的平均速度分别为4.66%和4.42%。我们还设置了具有不同微架构配置的模拟设置，并使用智能调度器探索潜在的改进，该调度器根据转码参数值将转码任务分配给最合适的配置。智能调度器的性能比随机调度器好3.72%，并且在75%的情况下与最佳调度器的性能相匹配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CPU Microarchitectural Performance Characterization of Cloud Video Transcoding

Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量