Who danced better? ranked tiktok dance video dataset and pairwise action quality assessment method

International Journal of Advances in Intelligent Informatics Pub Date : 2023-03-15 DOI:10.26555/ijain.v9i1.919

I. Hipiny, Hamimah Ujir, A. Alias, M. Shanat, Mohamad Khairi Ishak

{"title":"Who danced better? ranked tiktok dance video dataset and pairwise action quality assessment method","authors":"I. Hipiny, Hamimah Ujir, A. Alias, M. Shanat, Mohamad Khairi Ishak","doi":"10.26555/ijain.v9i1.919","DOIUrl":null,"url":null,"abstract":"Video-based action quality assessment (AQA) is a non-trivial task due to the subtle visual differences between data produced by experts and non-experts. Current methods are extended from the action recognition domain where most are based on temporal pattern matching. AQA has additional requirements where order and tempo matter for rating the quality of an action. We present a novel dataset of ranked TikTok dance videos, and a pairwise AQA method for predicting which video of a same-label pair was sourced from the better dancer. Exhaustive pairings of same-label videos were randomly assigned to 100 human annotators, ultimately producing a ranked list per label category. Our method relies on a successful detection of the subject’s 2D pose inside successive query frames where the order and tempo of actions are encoded inside a produced String sequence. The detected 2D pose returns a top-matching Visual word from a Codebook to represent the current frame. Given a same-label pair, we generate a String value of concatenated Visual words for each video. By computing the edit distance score between each String value and the Gold Standard’s (i.e., the top-ranked video(s) for that label category), we declare the video with the lower score as the winner. The pairwise AQA method is implemented using two schemes, i.e., with and without text compression. Although the average precision for both schemes over 12 label categories is low, at 0.45 with text compression and 0.48 without, precision values for several label categories are comparable to past methods’ (median: 0.47, max: 0.66).","PeriodicalId":52195,"journal":{"name":"International Journal of Advances in Intelligent Informatics","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advances in Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26555/ijain.v9i1.919","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video-based action quality assessment (AQA) is a non-trivial task due to the subtle visual differences between data produced by experts and non-experts. Current methods are extended from the action recognition domain where most are based on temporal pattern matching. AQA has additional requirements where order and tempo matter for rating the quality of an action. We present a novel dataset of ranked TikTok dance videos, and a pairwise AQA method for predicting which video of a same-label pair was sourced from the better dancer. Exhaustive pairings of same-label videos were randomly assigned to 100 human annotators, ultimately producing a ranked list per label category. Our method relies on a successful detection of the subject’s 2D pose inside successive query frames where the order and tempo of actions are encoded inside a produced String sequence. The detected 2D pose returns a top-matching Visual word from a Codebook to represent the current frame. Given a same-label pair, we generate a String value of concatenated Visual words for each video. By computing the edit distance score between each String value and the Gold Standard’s (i.e., the top-ranked video(s) for that label category), we declare the video with the lower score as the winner. The pairwise AQA method is implemented using two schemes, i.e., with and without text compression. Although the average precision for both schemes over 12 label categories is low, at 0.45 with text compression and 0.48 without, precision values for several label categories are comparable to past methods’ (median: 0.47, max: 0.66).

查看原文本刊更多论文

谁跳得更好?排名抖音舞蹈视频数据集及两两动作质量评估方法

基于视频的动作质量评估(AQA)是一项非常重要的任务，因为专家和非专家产生的数据在视觉上存在细微的差异。目前的方法大多是基于时间模式匹配的动作识别领域的扩展。AQA有额外的要求，其中顺序和速度对评估行动的质量很重要。我们提出了一个新的TikTok舞蹈视频排名数据集，以及一种成对AQA方法，用于预测相同标签对中的哪个视频来自更好的舞者。相同标签视频的详尽配对被随机分配给100名人类注释者，最终产生每个标签类别的排名列表。我们的方法依赖于在连续的查询帧中成功检测主体的2D姿势，其中动作的顺序和速度被编码在生成的字符串序列中。检测到的2D姿态从Codebook返回一个顶部匹配的Visual word来表示当前帧。给定一个相同标签对，我们为每个视频生成一个由连接的视觉单词组成的String值。通过计算每个字符串值与黄金标准值(即该标签类别中排名靠前的视频)之间的编辑距离得分，我们宣布得分较低的视频为获胜者。两两AQA方法使用两种方案来实现，即有文本压缩和没有文本压缩。虽然这两种方案在12个标签类别上的平均精度都很低，有文本压缩时为0.45，没有文本压缩时为0.48，但几个标签类别的精度值与过去的方法相当(中位数:0.47，最大值:0.66)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Advances in Intelligent Informatics Computer Science-Computer Vision and Pattern Recognition

CiteScore

3.00

自引率

0.00%

发文量