A Structured Analysis of the Video Degradation Effects on the Performance of a Machine Learning-enabled Pedestrian Detector

2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) Pub Date : 2021-06-30 DOI:10.1109/SEAA53835.2021.00053

C. Berger

{"title":"A Structured Analysis of the Video Degradation Effects on the Performance of a Machine Learning-enabled Pedestrian Detector","authors":"C. Berger","doi":"10.1109/SEAA53835.2021.00053","DOIUrl":null,"url":null,"abstract":"Machine Learning (ML)-enabled software systems have been incorporated in many public demonstrations for automated driving (AD) systems. Such solutions have also been considered as a crucial approach to aim at SAE Level 5 systems, where the passengers in such vehicles do not have to interact with the system at all anymore. Already in 2016, Nvidia demonstrated a complete end-to-end approach for training the complete software stack covering perception, planning and decision making, and the actual vehicle control. While such approaches show the great potential of such ML-enabled systems, there have also been demonstrations where already changes to single pixels in a video frame can potentially lead to completely different decisions with dangerous consequences in the worst case. In this paper, a structured analysis has been conducted to explore video degradation effects on the performance of an ML-enabled pedestrian detector. Firstly, a baseline of applying “You only look once” (YOLO) to 1,026 frames with pedestrian annotations in the KITTI Vision Benchmark Suite has been established. Next, video degradation candidates for each of these frames were generated using the leading video compression codecs libx264, libx265, Nvidia HEVC, and AV1: 52 frames for the various compression presets for color frames, and 52 frames for gray-scale frames resulting in 104 degradation candidates per original KITTI frame and in 426,816 images in total. YOLO was applied to each image to compute the intersection-over-union (IoU) metric to compare the performance with the original baseline. While aggressively lossy compression settings result in significant performance drops as expected, it was also observed that some configurations actually result in slightly better IoU results compared to the baseline. Hence, while related work in literature demonstrated the potentially negative consequences of even simple modifications to video data when using ML-enabled systems, the findings from this work show that carefully chosen lossy video configurations preserve a decent performance of particular ML-enabled systems while allowing for substantial savings when storing or transmitting data. Such aspects are of crucial importance when, for example, video data needs to be collected from multiple vehicles wirelessly, where lossy video codecs are required to cope with bandwidth limitations for example.","PeriodicalId":435977,"journal":{"name":"2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA53835.2021.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine Learning (ML)-enabled software systems have been incorporated in many public demonstrations for automated driving (AD) systems. Such solutions have also been considered as a crucial approach to aim at SAE Level 5 systems, where the passengers in such vehicles do not have to interact with the system at all anymore. Already in 2016, Nvidia demonstrated a complete end-to-end approach for training the complete software stack covering perception, planning and decision making, and the actual vehicle control. While such approaches show the great potential of such ML-enabled systems, there have also been demonstrations where already changes to single pixels in a video frame can potentially lead to completely different decisions with dangerous consequences in the worst case. In this paper, a structured analysis has been conducted to explore video degradation effects on the performance of an ML-enabled pedestrian detector. Firstly, a baseline of applying “You only look once” (YOLO) to 1,026 frames with pedestrian annotations in the KITTI Vision Benchmark Suite has been established. Next, video degradation candidates for each of these frames were generated using the leading video compression codecs libx264, libx265, Nvidia HEVC, and AV1: 52 frames for the various compression presets for color frames, and 52 frames for gray-scale frames resulting in 104 degradation candidates per original KITTI frame and in 426,816 images in total. YOLO was applied to each image to compute the intersection-over-union (IoU) metric to compare the performance with the original baseline. While aggressively lossy compression settings result in significant performance drops as expected, it was also observed that some configurations actually result in slightly better IoU results compared to the baseline. Hence, while related work in literature demonstrated the potentially negative consequences of even simple modifications to video data when using ML-enabled systems, the findings from this work show that carefully chosen lossy video configurations preserve a decent performance of particular ML-enabled systems while allowing for substantial savings when storing or transmitting data. Such aspects are of crucial importance when, for example, video data needs to be collected from multiple vehicles wirelessly, where lossy video codecs are required to cope with bandwidth limitations for example.

查看原文本刊更多论文

视频退化对机器学习行人检测器性能影响的结构化分析

支持机器学习(ML)的软件系统已被纳入许多自动驾驶(AD)系统的公开演示中。这种解决方案也被认为是实现SAE 5级系统的关键途径，即车辆中的乘客不再需要与系统互动。早在2016年，英伟达就展示了一种完整的端到端方法，用于培训涵盖感知、规划和决策以及实际车辆控制的完整软件堆栈。虽然这些方法显示了这种基于ml的系统的巨大潜力，但也有一些演示表明，在视频帧中改变单个像素可能会导致完全不同的决定，在最坏的情况下可能会带来危险的后果。本文进行了结构化分析，以探索视频退化对启用ml的行人检测器性能的影响。首先，在KITTI视觉基准测试套件中建立了对1026帧行人注释应用“You only look once”(YOLO)的基线。接下来，使用领先的视频压缩编解码器libx264、libx265、Nvidia HEVC和AV1生成这些帧的视频退化候选项:彩色帧的各种压缩预设为52帧，灰度帧的压缩预设为52帧，因此每个原始KITTI帧有104个退化候选项，总共有426,816张图像。对每张图像应用YOLO来计算相交-超并度(IoU)度量，并将性能与原始基线进行比较。虽然像预期的那样，严重的有损压缩设置会导致显著的性能下降，但也观察到，与基线相比，某些配置实际上会产生略好的IoU结果。因此，虽然文献中的相关工作表明，在使用支持ml的系统时，即使对视频数据进行简单的修改也会产生潜在的负面影响，但本工作的发现表明，精心选择的有损视频配置保留了特定支持ml的系统的良好性能，同时允许在存储或传输数据时节省大量费用。例如，当需要从多辆车无线收集视频数据时，这些方面是至关重要的，例如，需要有损视频编解码器来应对带宽限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

自引率

0.00%

发文量