从音乐表演视频中提取粗体动作:自动计算机视觉技术与动作捕捉数据的比较

Frontiers Digit. Humanit. Pub Date : 2017-04-06 DOI:10.3389/fdigh.2017.00009

Kelly Jakubowski, T. Eerola, Paolo Alborno, G. Volpe, A. Camurri, M. Clayton

{"title":"从音乐表演视频中提取粗体动作:自动计算机视觉技术与动作捕捉数据的比较","authors":"Kelly Jakubowski, T. Eerola, Paolo Alborno, G. Volpe, A. Camurri, M. Clayton","doi":"10.3389/fdigh.2017.00009","DOIUrl":null,"url":null,"abstract":"The measurement and tracking of body movement within musical performances can provide valuable sources of data for studying interpersonal interaction and coordination between musicians. The continued development of tools to extract such data from video recordings will offer new opportunities to research musical movement across a diverse range of settings, including field research and other ecological contexts in which the implementation of complex motion capture systems is not feasible or affordable. Such work might also make use of the multitude of video recordings of musical performances that are already available to researchers. The present study made use of such existing data, specifically, three video datasets of ensemble performances from different genres, settings, and instrumentation (a pop piano duo, three jazz duos, and a string quartet). Three different computer vision techniques were applied to these video datasets—frame differencing, optical flow, and kernelized correlation filters (KCF)—with the aim of quantifying and tracking movements of the individual performers. All three computer vision techniques exhibited high correlations with motion capture data collected from the same musical performances, with median correlation (Pearson’s r) values of .75 to .94. The techniques that track movement in two dimensions (optical flow and KCF) provided more accurate measures of movement than a technique that provides a single estimate of overall movement change by frame for each performer (frame differencing). Measurements of performer’s movements were also more accurate when the computer vision techniques were applied to more narrowly-defined regions of interest (head) than when the same techniques were applied to larger regions (entire upper body, above the chest or waist). Some differences in movement tracking accuracy emerged between the three video datasets, which may have been due to instrument-specific motions that resulted in occlusions of the body part of interest (e.g. a violinist’s right hand occluding the head whilst tracking head movement). These results indicate that computer vision techniques can be effective in quantifying body movement from videos of musical performances, while also highlighting constraints that must be dealt with when applying such techniques in ensemble coordination research.","PeriodicalId":227954,"journal":{"name":"Frontiers Digit. Humanit.","volume":"48 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Extracting Coarse Body Movements from Video in Music Performance: A Comparison of Automated Computer Vision Techniques with Motion Capture Data\",\"authors\":\"Kelly Jakubowski, T. Eerola, Paolo Alborno, G. Volpe, A. Camurri, M. Clayton\",\"doi\":\"10.3389/fdigh.2017.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The measurement and tracking of body movement within musical performances can provide valuable sources of data for studying interpersonal interaction and coordination between musicians. The continued development of tools to extract such data from video recordings will offer new opportunities to research musical movement across a diverse range of settings, including field research and other ecological contexts in which the implementation of complex motion capture systems is not feasible or affordable. Such work might also make use of the multitude of video recordings of musical performances that are already available to researchers. The present study made use of such existing data, specifically, three video datasets of ensemble performances from different genres, settings, and instrumentation (a pop piano duo, three jazz duos, and a string quartet). Three different computer vision techniques were applied to these video datasets—frame differencing, optical flow, and kernelized correlation filters (KCF)—with the aim of quantifying and tracking movements of the individual performers. All three computer vision techniques exhibited high correlations with motion capture data collected from the same musical performances, with median correlation (Pearson’s r) values of .75 to .94. The techniques that track movement in two dimensions (optical flow and KCF) provided more accurate measures of movement than a technique that provides a single estimate of overall movement change by frame for each performer (frame differencing). Measurements of performer’s movements were also more accurate when the computer vision techniques were applied to more narrowly-defined regions of interest (head) than when the same techniques were applied to larger regions (entire upper body, above the chest or waist). Some differences in movement tracking accuracy emerged between the three video datasets, which may have been due to instrument-specific motions that resulted in occlusions of the body part of interest (e.g. a violinist’s right hand occluding the head whilst tracking head movement). These results indicate that computer vision techniques can be effective in quantifying body movement from videos of musical performances, while also highlighting constraints that must be dealt with when applying such techniques in ensemble coordination research.\",\"PeriodicalId\":227954,\"journal\":{\"name\":\"Frontiers Digit. Humanit.\",\"volume\":\"48 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers Digit. Humanit.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fdigh.2017.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers Digit. Humanit.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdigh.2017.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

音乐表演中肢体动作的测量和跟踪可以为研究音乐家之间的人际互动和协调提供有价值的数据来源。从录像中提取此类数据的工具的持续发展将为在各种环境中研究音乐运动提供新的机会，包括实地研究和其他生态环境，在这些环境中，复杂的动作捕捉系统的实施是不可行的或负担不起的。这种工作也可以利用研究人员已经可以获得的大量音乐表演录像。目前的研究利用了这些现有的数据，特别是三个来自不同流派、背景和乐器的合奏表演的视频数据集(一个流行钢琴二重奏、三个爵士二重奏和一个弦乐四重奏)。三种不同的计算机视觉技术被应用于这些视频数据集——帧差分、光流和核相关滤波器(KCF)——目的是量化和跟踪单个表演者的运动。所有三种计算机视觉技术都与从同一音乐表演中收集的动作捕捉数据显示出高度相关性，中位数相关性(Pearson’s r)值为0.75至0.94。跟踪二维运动的技术(光流和KCF)提供了更准确的运动测量，而不是对每个表演者每帧的整体运动变化进行单一估计的技术(帧差)。当计算机视觉技术应用于更狭窄的感兴趣区域(头部)时，对表演者动作的测量也比同样的技术应用于更大的区域(整个上身，胸部或腰部以上)时更准确。在三个视频数据集之间出现了一些运动跟踪精度的差异，这可能是由于特定乐器的运动导致感兴趣的身体部位闭塞(例如，小提琴手的右手在跟踪头部运动时阻塞了头部)。这些结果表明，计算机视觉技术可以有效地从音乐表演视频中量化身体运动，同时也突出了在合奏协调研究中应用此类技术时必须处理的约束。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extracting Coarse Body Movements from Video in Music Performance: A Comparison of Automated Computer Vision Techniques with Motion Capture Data

The measurement and tracking of body movement within musical performances can provide valuable sources of data for studying interpersonal interaction and coordination between musicians. The continued development of tools to extract such data from video recordings will offer new opportunities to research musical movement across a diverse range of settings, including field research and other ecological contexts in which the implementation of complex motion capture systems is not feasible or affordable. Such work might also make use of the multitude of video recordings of musical performances that are already available to researchers. The present study made use of such existing data, specifically, three video datasets of ensemble performances from different genres, settings, and instrumentation (a pop piano duo, three jazz duos, and a string quartet). Three different computer vision techniques were applied to these video datasets—frame differencing, optical flow, and kernelized correlation filters (KCF)—with the aim of quantifying and tracking movements of the individual performers. All three computer vision techniques exhibited high correlations with motion capture data collected from the same musical performances, with median correlation (Pearson’s r) values of .75 to .94. The techniques that track movement in two dimensions (optical flow and KCF) provided more accurate measures of movement than a technique that provides a single estimate of overall movement change by frame for each performer (frame differencing). Measurements of performer’s movements were also more accurate when the computer vision techniques were applied to more narrowly-defined regions of interest (head) than when the same techniques were applied to larger regions (entire upper body, above the chest or waist). Some differences in movement tracking accuracy emerged between the three video datasets, which may have been due to instrument-specific motions that resulted in occlusions of the body part of interest (e.g. a violinist’s right hand occluding the head whilst tracking head movement). These results indicate that computer vision techniques can be effective in quantifying body movement from videos of musical performances, while also highlighting constraints that must be dealt with when applying such techniques in ensemble coordination research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers Digit. Humanit.

自引率

0.00%

发文量