{"title":"CheckSORT: Refined Synthetic Data Combination and Optimized SORT for Automatic Retail Checkout","authors":"Ziqiang Shi, Zhongling Liu, Liu Liu, Rujie Liu, Takuma Yamamoto, Xiaoyue Mi, Daisuke Uchida","doi":"10.1109/CVPRW59228.2023.00569","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00569","url":null,"abstract":"In this paper, we propose a method called CheckSORT for automatic retail checkout. We demonstrate CheckSORT on the multi-class product counting and recognition task in Track 4 of AI CITY CHALLENGE 2023. This task aims to count and identify products as they move along a retail checkout white tray, which is challenging due to occlusion, similar appearance, or blur. Based on the constraints and training data provided by the sponsor, we propose two new ideas to solve this task. The first idea is to design a controllable synthetic training data generation paradigm to bridge the gap between training data and real test videos as much as possible. The second innovation is to improve the efficiency of existing SORT tracking algorithms by proposing decomposed Kalman filter and dynamic tracklet feature sequence. Our experiments resulted in state-of-the-art (when compared with DeepSORT and StrongSORT) F1-scores of 70.3% and 62.1% on the TestA data of AI CITY CHALLENGE 2022 and 2023 respectively in the estimation of the time (in seconds) for the product to appear on the tray. Training and testing code will be available soon on github.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133044417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Single Residual Network with ESA Modules and Distillation","authors":"Yucong Wang, Minjie Cai","doi":"10.1109/CVPRW59228.2023.00191","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00191","url":null,"abstract":"Although there are many methods based on deep learning that have superior performance on single image super-resolution (SISR), it is difficult to run in real time on devices with limited computing power. Some recent studies have found that simply relying on reducing parameters or reducing the theoretical FLOPs of the model does not speed up the inference time of the network in a practical sense. Actual speed on the device is probably a better measure than FLOPs. In this work, we propose a new single residual network (SRN). On the one hand, we try to introduce and optimize an attention mechanism module to improve the performance of the network with a relatively small speed loss. On the other hand, we find that residuals in residual blocks do not have a positive impact on networks with adjusted ESA. Therefore, the residual of the network residual block is removed, which not only improves the speed of the network, but also improves the performance of the network. Finally, we reduced the number of channels and the number of residual blocks of the classic model EDSR, and removed the last convolution before the long residual. We set this tuned EDSR as the teacher model and our newly proposed SRN as the student model. Under the joint effect of the original loss and the distillation loss, the performance of the network can be improved without losing the inference time. Combining the above strategies, our proposed model runs much faster than similarly performing models. As an example, we built a Fast and Efficient Network (SRN) and its small version SRN-S, which run 30%-37% faster than the state-of-the-art EISR model: a paper champion RLFN. Furthermore, the shallow version of SRN-S achieves the second-shortest inference time as well as the second-smallest number of activations in the NTIRE2023 challenge. Code will be available at https://github.com/wnxbwyc/SRN.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114414428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly Supervised Visual Question Answer Generation","authors":"Charani Alampalle, Shamanthak Hegde, Soumya Jahagirdar, Shankar Gangisetty","doi":"10.1109/CVPRW59228.2023.00591","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00591","url":null,"abstract":"Growing interest in conversational agents promote two-way human-computer communications involving asking and answering visual questions have become an active area of research in AI. Thus, generation of visual question-answer pair(s) becomes an important and challenging task. To address this issue, we propose a weakly-supervised visual question answer generation method that generates a relevant question-answer pairs for a given input image and associated caption. Most of the prior works are supervised and depend on the annotated question-answer datasets. In our work, we present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions. The proposed method initially extracts list of answer words, then does nearest question generation that uses the caption and answer word to generate synthetic question. Next, the relevant question generator converts the nearest question to relevant language question by dependency parsing and in-order tree traversal, finally, fine-tune a ViLBERT model with the question-answer pair(s) generated at end. We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperform SOTA methods on BLEU scores. We also show the results wrt baseline models and ablation study.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114743755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Ancuti, Cosmin Ancuti, Florin-Alexandru Vasluianu, R. Timofte, Han Zhou, Wei Dong, Yangyi Liu, Jun Chen, Huan Liu, Liangyan Li, Zijun Wu, Yubo Dong, Yuyang Li, Tian Qiu, Yuying He, Yonghong Lu, Yinwei Wu, Zhenxiang Jiang, Songhua Liu, Xingyi Yang, Yongcheng Jing, Bilel Benjdira, Anas M. Ali, A. Koubâa, Hao-Hsiang Yang, I-Hsiang Chen, Wei-Ting Chen, Zhi-Kai Huang, Yi-Chung Chen, Chia-Hsuan Hsieh, Hua-En Chang, Yuan Chiang, Sy-Yen Kuo, Yu Guo, Yuan Gao, R. Liu, Yuxu Lu, Jingxiang Qu, Shengfeng He, Wenqi Ren, Trung Hoang, Haichuan Zhang, Amirsaeed Yazdani, V. Monga, Lehan Yang, Alex Wu, Tiancheng Mai, Xiaofeng Cong, Xuemeng Yin, Xuefei Yin, Hazim Emad, Ahmed Abdallah, Y. Yasser, Dalia Elshahat, Esraa Elbaz, Zhan-ying Li, Wenqing Kuang, Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas Bo Schön, Zhao Zhang, Yanyan Wei, Junhu Wang, Suiyi Zhao, Huan Zheng, Jinkang Guo, Ya Sun, T. Liu, D. Hao, Kui Jiang, Anjali Sarvaiya, Kalpesh P. Prajapati, R. Patra, Pragnesh Barik, C. Rathod, Kishor P. Upl
{"title":"NTIRE 2023 HR NonHomogeneous Dehazing Challenge Report","authors":"C. Ancuti, Cosmin Ancuti, Florin-Alexandru Vasluianu, R. Timofte, Han Zhou, Wei Dong, Yangyi Liu, Jun Chen, Huan Liu, Liangyan Li, Zijun Wu, Yubo Dong, Yuyang Li, Tian Qiu, Yuying He, Yonghong Lu, Yinwei Wu, Zhenxiang Jiang, Songhua Liu, Xingyi Yang, Yongcheng Jing, Bilel Benjdira, Anas M. Ali, A. Koubâa, Hao-Hsiang Yang, I-Hsiang Chen, Wei-Ting Chen, Zhi-Kai Huang, Yi-Chung Chen, Chia-Hsuan Hsieh, Hua-En Chang, Yuan Chiang, Sy-Yen Kuo, Yu Guo, Yuan Gao, R. Liu, Yuxu Lu, Jingxiang Qu, Shengfeng He, Wenqi Ren, Trung Hoang, Haichuan Zhang, Amirsaeed Yazdani, V. Monga, Lehan Yang, Alex Wu, Tiancheng Mai, Xiaofeng Cong, Xuemeng Yin, Xuefei Yin, Hazim Emad, Ahmed Abdallah, Y. Yasser, Dalia Elshahat, Esraa Elbaz, Zhan-ying Li, Wenqing Kuang, Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas Bo Schön, Zhao Zhang, Yanyan Wei, Junhu Wang, Suiyi Zhao, Huan Zheng, Jinkang Guo, Ya Sun, T. Liu, D. Hao, Kui Jiang, Anjali Sarvaiya, Kalpesh P. Prajapati, R. Patra, Pragnesh Barik, C. Rathod, Kishor P. Upl","doi":"10.1109/CVPRW59228.2023.00180","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00180","url":null,"abstract":"This study assesses the outcomes of the NTIRE 2023 Challenge on Non-Homogeneous Dehazing, wherein novel techniques were proposed and evaluated on new image dataset called HD-NH-HAZE. The HD-NH-HAZE dataset contains 50 high resolution pairs of real-life outdoor images featuring nonhomogeneous hazy images and corresponding haze-free images of the same scene. The nonhomogeneous haze was simulated using a professional setup that replicated real-world conditions of hazy scenarios. The competition had 246 participants and 17 teams submitted solutions for the final testing phase. The proposed solutions demonstrated the cutting-edge in image dehazing technology.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115734350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Pham, Duong Nguyen-Ngoc Tran, Huy-Hung Nguyen, Hyung-Joon Jeon, T. H. Tran, Hyung-Min Jeon, Jaewook Jeon
{"title":"Improving Deep Learning-based Automatic Checkout System Using Image Enhancement Techniques","authors":"L. Pham, Duong Nguyen-Ngoc Tran, Huy-Hung Nguyen, Hyung-Joon Jeon, T. H. Tran, Hyung-Min Jeon, Jaewook Jeon","doi":"10.1109/CVPRW59228.2023.00562","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00562","url":null,"abstract":"The retail sector has experienced significant growth in artificial intelligence and computer vision applications, particularly with the emergence of automatic checkout (ACO) systems in stores and supermarkets. ACO systems encounter challenges such as object occlusion, motion blur, and similarity between scanned items while acquiring accurate training images for realistic checkout scenarios is difficult due to constant product updates. This paper improves existing deep learning-based ACO solutions by incorporating several image enhancement techniques in the data pre-processing step. The proposed ACO system employs a detect-and-track strategy, which involves: (1) detecting objects in areas of interest; (2) tracking objects in consecutive frames; and (3) counting objects using a track management pipeline. Several data generation techniques—including copy-and-paste, random placement, and augmentation—are employed to create diverse training data. Additionally, the proposed solution is designed as an open-ended framework that can be easily expanded to accommodate multiple tasks. The system has been evaluated on the AI City Challenge 2023 Track 4 dataset, showcasing outstanding performance by achieving a top-1 ranking on test-set A with an F1 score of 0.9792.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116057528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Dehazing Powered by Image Processing Network","authors":"Guisik Kim, Jin-Hyeong Park, Junseok Kwon","doi":"10.1109/CVPRW59228.2023.00128","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00128","url":null,"abstract":"Image processing is a very fundamental technique in the field of low-level vision. However, with the development of deep learning over the past five years, most low-level vision methods tend to ignore this technique. Recent dehazing methods also refrain from using conventional image processing techniques, whereas only focusing on the development of new deep neural network (DNN) architectures. Unlike this recent trend, we show that image processing techniques are still competitive, if they are incorporated into DNNs. In this paper, we utilize conventional image processing techniques (i.e. curve adjustment, retinex decomposition, and multiple image fusion) for accurate dehazing. Moreover, we employ direct learning for stable dehazing performance. The proposed method can perform with low computational cost and easy to learn. The experimental results demonstrate that the proposed method produces accurate dehazing results compared to recent algorithms.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123367190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detail-Preserving Self-Supervised Monocular Depth with Self-Supervised Structural Sharpening","authors":"J. Bello, Jaeho Moon, Munchurl Kim","doi":"10.1109/CVPRW59228.2023.00031","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00031","url":null,"abstract":"We propose to further close the gap between self-supervised and fully-supervised methods for the single view depth estimation (SVDE) task in terms of the levels of detail and sharpness in the estimated depth maps. Detailed SVDE is challenging as even fully-supervised methods struggle to obtain detail-preserving depth estimates. While recent works have proposed exploiting semantic masks to improve the structural information in the estimated depth maps, our proposed method yields detail-preserving depth estimates from a single forward pass without increasing the computational cost or requiring additional data. We achieve this by exploiting a missing component in SVDE, Self-Supervised Structural Sharpening, referred to as S4. S4 is a mechanism that encourages a similar level of detail between the RGB input and the depth/disparity output. To this extent, we propose a novel DispNet-S4 network for detail-preserving SVDE. Our network exploits un-blurring and un-noising tasks of clean input images for learning S4 without the need for either additional data (e.g., segmentation masks, matting maps, etc.) or advanced network blocks (attention, transformers, etc.). The recovered structural details in the un-blurring and un-noising operations are transferred to the estimated depth maps via adaptive convolutions to yield structurally sharpened depths that are selectively used for self-supervision. We provide extensive experimental results and ablation studies that show our proposed DispNetS4 network can yield fine details in the depth maps while achieving quantitative metrics comparable to the state-of-the-art for the challenging KITTI dataset.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAN-based Vision Transformer for High-Quality Thermal Image Enhancement","authors":"M. Marnissi, A. Fathallah","doi":"10.1109/CVPRW59228.2023.00089","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00089","url":null,"abstract":"Generative Adversarial Networks (GANs) have shown an outstanding ability to generate high-quality images with visual realism and similarity to real images. This paper presents a new architecture for thermal image enhancement. Precisely, the strengths of architecture-based vision transformers and generative adversarial networks are exploited. The thermal loss feature introduced in our approach is specifically used to produce high-quality images. Thermal image enhancement also relies on fine-tuning based on visible images, resulting in an overall improvement in image quality. A visual quality metric was used to evaluate the performance of the proposed architecture. Significant improvements were found over the original thermal images and other enhancement methods established on a subset of the KAIST dataset. The performance of the proposed enhancement architecture is also verified on the detection results by obtaining better performance with a considerable margin regarding different versions of the YOLO detector.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huy Duong Le, Minh Quan Vu, Manh Tung Tran, Nguyen Van Phuc
{"title":"Triplet Temporal-based Video Recognition with Multiview for Temporal Action Localization","authors":"Huy Duong Le, Minh Quan Vu, Manh Tung Tran, Nguyen Van Phuc","doi":"10.1109/CVPRW59228.2023.00573","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00573","url":null,"abstract":"Temporal action localization (TAL) in untrimmed videos recently emerged as a crucial research topic, which has been applied in various applications such as surveillance, crowd monitoring, and driver distraction recognition. Most modern approaches in TAL divide this problem into two parts: i) feature extraction for action recognition; and ii) temporal boundary for action localization. In this study, we focus on improving the performance of the TAL task by exploiting the feature extraction effectively. Specifically, we present a temporal triplet algorithm in order to enhance temporal density-dependence information for the input video clips. Moreover, the multiview fusion framework is taken into account for enriching action representation. For the evaluation, we conduct the proposed method on the 2023 AI City Challenge Dataset. Accordingly, our method achieves competitive results and belongs to the top public leaderboard in Track 3 of the Challenge.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124280823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Octavio Arriaga, Sebastián M. Palacio, Matias Valdenegro-Toro
{"title":"Difficulty Estimation with Action Scores for Computer Vision Tasks","authors":"Octavio Arriaga, Sebastián M. Palacio, Matias Valdenegro-Toro","doi":"10.1109/CVPRW59228.2023.00030","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00030","url":null,"abstract":"As more machine learning models are now being applied in real world scenarios it has become crucial to evaluate their difficulties and biases. In this paper we present an unsupervised method for calculating a difficulty score based on the accumulated loss per epoch. Our proposed method does not require any modification to the model, neither any external supervision, and it can be easily applied to a wide range of machine learning tasks. We provide results for the tasks of image classification, image segmentation, and object detection. We compare our score against similar metrics and provide theoretical and empirical evidence of their difference. Furthermore, we show applications of our proposed score for detecting incorrect labels, and test for possible biases.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129816305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}