A. Kalukin, Wade Leonard, Joan Green, L. Burgwardt
{"title":"使用视频源自动生成卷积神经网络训练数据","authors":"A. Kalukin, Wade Leonard, Joan Green, L. Burgwardt","doi":"10.1109/AIPR.2017.8457936","DOIUrl":null,"url":null,"abstract":"One of the challenges of using techniques such as convolutional neural networks and deep learning for automated object recognition in images and video is to be able to generate sufficient quantities of labeled training image data in a cost-effective way. It is generally preferred to tag hundreds of thousands of frames for each category or label, and a human being tagging images frame by frame might expect to spend hundreds of hours creating such a training set. One alternative is to use video as a source of training images. A human tagger notes the start and stop time in each clip for the appearance of objects of interest. The video is broken down into component frames using software such as ffmpeg. The frames that fall within the time intervals for objects of interest are labeled as “targets,” and the remaining frames are labeled as “non-targets.” This separation of categories can be automated. The time required by a human viewer using this method would be around ten hours, at least 1–2 orders of magnitude lower than a human tagger labeling frame by frame. The false alarm rate and target detection rate can by optimized by providing the system unambiguous training examples.","PeriodicalId":128779,"journal":{"name":"2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated generation of convolutional neural network training data using video sources\",\"authors\":\"A. Kalukin, Wade Leonard, Joan Green, L. Burgwardt\",\"doi\":\"10.1109/AIPR.2017.8457936\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the challenges of using techniques such as convolutional neural networks and deep learning for automated object recognition in images and video is to be able to generate sufficient quantities of labeled training image data in a cost-effective way. It is generally preferred to tag hundreds of thousands of frames for each category or label, and a human being tagging images frame by frame might expect to spend hundreds of hours creating such a training set. One alternative is to use video as a source of training images. A human tagger notes the start and stop time in each clip for the appearance of objects of interest. The video is broken down into component frames using software such as ffmpeg. The frames that fall within the time intervals for objects of interest are labeled as “targets,” and the remaining frames are labeled as “non-targets.” This separation of categories can be automated. The time required by a human viewer using this method would be around ten hours, at least 1–2 orders of magnitude lower than a human tagger labeling frame by frame. The false alarm rate and target detection rate can by optimized by providing the system unambiguous training examples.\",\"PeriodicalId\":128779,\"journal\":{\"name\":\"2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIPR.2017.8457936\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIPR.2017.8457936","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Automated generation of convolutional neural network training data using video sources
One of the challenges of using techniques such as convolutional neural networks and deep learning for automated object recognition in images and video is to be able to generate sufficient quantities of labeled training image data in a cost-effective way. It is generally preferred to tag hundreds of thousands of frames for each category or label, and a human being tagging images frame by frame might expect to spend hundreds of hours creating such a training set. One alternative is to use video as a source of training images. A human tagger notes the start and stop time in each clip for the appearance of objects of interest. The video is broken down into component frames using software such as ffmpeg. The frames that fall within the time intervals for objects of interest are labeled as “targets,” and the remaining frames are labeled as “non-targets.” This separation of categories can be automated. The time required by a human viewer using this method would be around ten hours, at least 1–2 orders of magnitude lower than a human tagger labeling frame by frame. The false alarm rate and target detection rate can by optimized by providing the system unambiguous training examples.