Stefano Bini, Antonio Greco, Alessia Saggese, M. Vento
{"title":"嵌入式设备上手势识别的深度神经网络基准测试*","authors":"Stefano Bini, Antonio Greco, Alessia Saggese, M. Vento","doi":"10.1109/RO-MAN53752.2022.9900705","DOIUrl":null,"url":null,"abstract":"The gesture is one of the most used forms of communication between humans; in recent years, given the new trend of factories to be adapted to Industry 4.0 paradigm, the scientific community has shown a growing interest towards the design of Gesture Recognition (GR) algorithms for Human-Robot Interaction (HRI) applications. Within this context, the GR algorithm needs to work in real time and over embedded platforms, with limited resources. Anyway, when looking at the available scientific literature, the aim of the different proposed neural networks (i.e. 2D and 3D) and of the different modalities used for feeding the network (i.e. RGB, RGB-D, optical flow) is typically the optimization of the accuracy, without strongly paying attention to the feasibility over low power hardware devices. Anyway, the analysis related to the trade-off between accuracy and computational burden (for both networks and modalities) becomes important so as to allow GR algorithms to work in industrial robotics applications. In this paper, we perform a wide benchmarking focusing not only on the accuracy but also on the computational burden, involving two different architectures (2D and 3D), with two different backbones (MobileNet, ResNeXt) and four types of input modalities (RGB, Depth, Optical Flow, Motion History Image) and their combinations.","PeriodicalId":250997,"journal":{"name":"2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Benchmarking deep neural networks for gesture recognition on embedded devices *\",\"authors\":\"Stefano Bini, Antonio Greco, Alessia Saggese, M. Vento\",\"doi\":\"10.1109/RO-MAN53752.2022.9900705\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The gesture is one of the most used forms of communication between humans; in recent years, given the new trend of factories to be adapted to Industry 4.0 paradigm, the scientific community has shown a growing interest towards the design of Gesture Recognition (GR) algorithms for Human-Robot Interaction (HRI) applications. Within this context, the GR algorithm needs to work in real time and over embedded platforms, with limited resources. Anyway, when looking at the available scientific literature, the aim of the different proposed neural networks (i.e. 2D and 3D) and of the different modalities used for feeding the network (i.e. RGB, RGB-D, optical flow) is typically the optimization of the accuracy, without strongly paying attention to the feasibility over low power hardware devices. Anyway, the analysis related to the trade-off between accuracy and computational burden (for both networks and modalities) becomes important so as to allow GR algorithms to work in industrial robotics applications. In this paper, we perform a wide benchmarking focusing not only on the accuracy but also on the computational burden, involving two different architectures (2D and 3D), with two different backbones (MobileNet, ResNeXt) and four types of input modalities (RGB, Depth, Optical Flow, Motion History Image) and their combinations.\",\"PeriodicalId\":250997,\"journal\":{\"name\":\"2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RO-MAN53752.2022.9900705\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RO-MAN53752.2022.9900705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Benchmarking deep neural networks for gesture recognition on embedded devices *
The gesture is one of the most used forms of communication between humans; in recent years, given the new trend of factories to be adapted to Industry 4.0 paradigm, the scientific community has shown a growing interest towards the design of Gesture Recognition (GR) algorithms for Human-Robot Interaction (HRI) applications. Within this context, the GR algorithm needs to work in real time and over embedded platforms, with limited resources. Anyway, when looking at the available scientific literature, the aim of the different proposed neural networks (i.e. 2D and 3D) and of the different modalities used for feeding the network (i.e. RGB, RGB-D, optical flow) is typically the optimization of the accuracy, without strongly paying attention to the feasibility over low power hardware devices. Anyway, the analysis related to the trade-off between accuracy and computational burden (for both networks and modalities) becomes important so as to allow GR algorithms to work in industrial robotics applications. In this paper, we perform a wide benchmarking focusing not only on the accuracy but also on the computational burden, involving two different architectures (2D and 3D), with two different backbones (MobileNet, ResNeXt) and four types of input modalities (RGB, Depth, Optical Flow, Motion History Image) and their combinations.