Human-Directed Optical Music Recognition

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI:10.2352/ISSN.2470-1173.2016.17.DRR-053

Liang Chen, C. Raphael

{"title":"Human-Directed Optical Music Recognition","authors":"Liang Chen, C. Raphael","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-053","DOIUrl":null,"url":null,"abstract":"We propose a human-in-the-loop scheme for optical music recognition. Starting from the results of our recognition engine, we pose the problem as one of constrained optimization, in which the human can specify various pixel labels, while our recognition engine seeks an optimal explanation subject to the humansupplied constraints. In this way we enable an interactive approach with a uniform communication channel from human to machine where both iterate their roles until the desired end is achieved. Pixel constraints may be added to various stages, including staff finding, system identification, and measure recognition. Results on a test show significant speed up when compared to purely human-driven correction. Introduction Optical Music Recognition (OMR) holds potential to transform score images into symbolic music libraries, thus enabling search, categorization, and retrieval by symbolic content, as we now take for granted with text. Such symbolic libraries would serve as the foundation for the emerging field of computational musicology, and provide data for a wide variety of fusions between music, computer science, and statistics. Equally exciting are applications such as the digital music stand, and systems that support practice and learning through objective analysis of rhythm and pitch. In spite of this promise, progress in OMR has been slow; even the best systems, both commercial and academic, leave much to be desired[7]. In many cases the effort needed to correct OMR output may be more than that of entering the music data from scratch[8]. In such cases OMR systems fail to make any meaningful contribution at all. The reason for these disappointing results is simply that OMR is hard. Bainbridge [17] discusses some the challenges of OMR that impede its development. One central problem is that music notation contains a large variety of somewhat-rare musical symbols and conventions [4], such as articulations, bowings, tremolos, fingerings, accents, harmonics, stops, repeat marks, 1st and 2nd endings, dal segno and da capo markings, trills, mordants, turns, breath marks, etc. While one can easily build recognizers that accommodate these somewhat-unusual symbols and special notational cases, the false positive detections that result often outweigh the additional correct detections they produce. Under some circumstances, some not-so-rare symbols fall into this better-not-to-recognize category, such as augmentation dots, double sharps, and partial beams. Another issue arises from the difficulty in describing the high-level structure of music notation. Objects such as chords, beamed groups, and clef-key-signatures, are highly structured and lend themselves naturally to grammatical representation, however, the overall organization of symbols within a measure is far less constrained. The OMR literature contains several efforts to formulate a unified grammar for music notation [10, 11]. These approaches represent grammars of primitive symbols (beams, flags, note heads, stems, etc.) and begin by assuming a collection of segmented primitives. While our grammars have significant overlap with these approaches, one of our primary uses for the grammar is the segmentation of the symbols into primitives — we do not believe it is realistic to identify the primitives without understanding the larger structures that contain them. Kopec [12] describes a compelling Markov Source Model for music recognition that simultaneously segments and recognizes. However, the approach addresses a small subset of music notation and does not generalize in any obvious way. In particular, our primary focus is on the International Music Score Library Project (IMSLP), while Kopec’s model covers a small minority of the examples encountered there. Other difficulties stem from the kinds of image degradation encountered, including poor or variable contrast, skew and warping of an image caused when the document is not aligned or flat in the scanner bed, hand-written marks, damage to pages, as well as other sources. Some recent research has been dedicated to the improvement of fully automated OMR systems in post-process fashion, or other ways that leave the core recognition engine intact. These efforts either create systems that adapt automatically [16, 24], add musically meaningful constraints for recognition [1, 5], or combine multiple recognizers to achieve better accuracy [9, 7]. However, OMR research is still a long way from our shared goal of creating large scale symbolic music databases. Hankinson et al. [15] created a prototype system for distributed large-scale OMR, which converts a collection of Gregorian chant scores into symbolic files to facilitate their in situ content-based retrieval, though the approach still requires a large amount of careful proofreading and correction. In light of these many obstacles and our collective past history, it seems unwise to bet on fully automated OMR systems that will produce high-quality results with any consistency. Instead we favor casting the problem as an interactive one, thus putting the human in the computational loop. In this case the essential challenge becomes one of minimizing the user’s effort, putting as much burden as possible on the computer, (but no more). There are many creative ways to integrate a person into the recognition pipeline, allowing her to correct, give hints, or direct the computation. This work constitutes an effort in this direction. Our first attempt to bring the human into OMR pipeline built a user interface allowing the correction of individual primitives: stem, beam, note head, single flag, sharp, augmentation dot, etc. Thus the user’s task was simply to cover the image ink by adding and deleting appropriate primitives. A benefit of this approach is that it presents the user with a clearly-defined task that doesn’t require knowledge of the system’s inner workings. There are, however, several weaknesses to this approach: the human tagging process is laborious; it fails to provide important syntactic relations between primitives; it requires the person to precisely register the primitive with the image; and it allows the person to create uninterpretable configurations of primitives (say a stem with no note head) creating havoc further down the OMR pipeline. Our aim here is to improve on all these weaknesses while still presenting a simple task to the user. Our current approach first presents the user with the original recognition results, obtained through fully automatic means. The user may then label any individual pixel according to the recognition task at hand. For instance, during system recognition the user may label a pixel as white space or bar line, while during measure recognition we use a richer collection of labels including, closed/half/whole note head, stem, ledger line, beam, sharp, single flag, etc. The system then re-recognizes subject to the user-imposed constraint. Since our recognizers embed highly restrictive assumptions on the primitives they assemble, a single correction often fixes a number of problems at once. Human and machine then iterate the process of providing and synthesizing human-supplied constraints into recognized results. This approach leaves the registration problem — the precise location of primitives — in the hands of the machine, where we believe it belongs. Furthermore, since our system can only recognize meaningful configurations of symbols, we avoid the problem of trying to assemble human-tagged composite symbols that may not make sense. While the resulting process may still be laborious, our results indicate that the human burden can be reduced considerably by employing this strategy. Furthermore, there are many other ways of introducing human-specified constraints into the recognition process, thus the current effort constitutes an initial exploration of a longer-term goal. Interactive OMR Various authors, such as Rebelo [13], suggest that interactive OMR system could be a realistic solution to the problem, though the central challenge of fusing the human and machine contributions still remains open. Human-in-the-loop computation has received considerable attention recently [23]. It has been applied to a wide variety of areas, such as retrieval systems [19], object classification [20], character recognition [18], document indexing [25], image labeling [22] and fined-grained visual categorization [21]. Romero [26] proposed a Hidden Markov Model (HMM) for computer-assisted text transcription, in which the user-imposed prefix is used to constrain both the sequence decoding and language priors. The potential of all these different applications is summarized in von Ahn’s statement [18]: “Human processing power can be harnessed to solve problems that computer cannot yet solve.” There have already been several OMR systems taking into account human-in-the-loop computation. For instance, Fujinaga [4] proposed an adaptive system that could incrementally improve its symbol classifiers based on human feedback. Church [6] implemented an interface accepting user feedback to guide misrecognized measures toward similar correct measures found elsewhere in the score. Our system uses human feedback in an entirely different manner — as a means of constraining the recognition process in a user-specified manner, thus leveraging the user’s input in the heart of the system. It is worth noting that our approach constitutes a generic framework that poses human-in-theloop recognition as constrained optimization, applicable beyond the specific confines of OMR. Human-Directed Recognition As motivation consider the example given in Figure 1. Suppose our recognition misses the upper note head of the chord (Figure 1b). Then suppose the user labels a single pixel that belongs to the missing note head as solid head (Figure 1c). When the system re-recognizes subject to this constraint, the note head, its associated ledger line, accidental and stem portion may all b","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Document Recognition and Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

We propose a human-in-the-loop scheme for optical music recognition. Starting from the results of our recognition engine, we pose the problem as one of constrained optimization, in which the human can specify various pixel labels, while our recognition engine seeks an optimal explanation subject to the humansupplied constraints. In this way we enable an interactive approach with a uniform communication channel from human to machine where both iterate their roles until the desired end is achieved. Pixel constraints may be added to various stages, including staff finding, system identification, and measure recognition. Results on a test show significant speed up when compared to purely human-driven correction. Introduction Optical Music Recognition (OMR) holds potential to transform score images into symbolic music libraries, thus enabling search, categorization, and retrieval by symbolic content, as we now take for granted with text. Such symbolic libraries would serve as the foundation for the emerging field of computational musicology, and provide data for a wide variety of fusions between music, computer science, and statistics. Equally exciting are applications such as the digital music stand, and systems that support practice and learning through objective analysis of rhythm and pitch. In spite of this promise, progress in OMR has been slow; even the best systems, both commercial and academic, leave much to be desired[7]. In many cases the effort needed to correct OMR output may be more than that of entering the music data from scratch[8]. In such cases OMR systems fail to make any meaningful contribution at all. The reason for these disappointing results is simply that OMR is hard. Bainbridge [17] discusses some the challenges of OMR that impede its development. One central problem is that music notation contains a large variety of somewhat-rare musical symbols and conventions [4], such as articulations, bowings, tremolos, fingerings, accents, harmonics, stops, repeat marks, 1st and 2nd endings, dal segno and da capo markings, trills, mordants, turns, breath marks, etc. While one can easily build recognizers that accommodate these somewhat-unusual symbols and special notational cases, the false positive detections that result often outweigh the additional correct detections they produce. Under some circumstances, some not-so-rare symbols fall into this better-not-to-recognize category, such as augmentation dots, double sharps, and partial beams. Another issue arises from the difficulty in describing the high-level structure of music notation. Objects such as chords, beamed groups, and clef-key-signatures, are highly structured and lend themselves naturally to grammatical representation, however, the overall organization of symbols within a measure is far less constrained. The OMR literature contains several efforts to formulate a unified grammar for music notation [10, 11]. These approaches represent grammars of primitive symbols (beams, flags, note heads, stems, etc.) and begin by assuming a collection of segmented primitives. While our grammars have significant overlap with these approaches, one of our primary uses for the grammar is the segmentation of the symbols into primitives — we do not believe it is realistic to identify the primitives without understanding the larger structures that contain them. Kopec [12] describes a compelling Markov Source Model for music recognition that simultaneously segments and recognizes. However, the approach addresses a small subset of music notation and does not generalize in any obvious way. In particular, our primary focus is on the International Music Score Library Project (IMSLP), while Kopec’s model covers a small minority of the examples encountered there. Other difficulties stem from the kinds of image degradation encountered, including poor or variable contrast, skew and warping of an image caused when the document is not aligned or flat in the scanner bed, hand-written marks, damage to pages, as well as other sources. Some recent research has been dedicated to the improvement of fully automated OMR systems in post-process fashion, or other ways that leave the core recognition engine intact. These efforts either create systems that adapt automatically [16, 24], add musically meaningful constraints for recognition [1, 5], or combine multiple recognizers to achieve better accuracy [9, 7]. However, OMR research is still a long way from our shared goal of creating large scale symbolic music databases. Hankinson et al. [15] created a prototype system for distributed large-scale OMR, which converts a collection of Gregorian chant scores into symbolic files to facilitate their in situ content-based retrieval, though the approach still requires a large amount of careful proofreading and correction. In light of these many obstacles and our collective past history, it seems unwise to bet on fully automated OMR systems that will produce high-quality results with any consistency. Instead we favor casting the problem as an interactive one, thus putting the human in the computational loop. In this case the essential challenge becomes one of minimizing the user’s effort, putting as much burden as possible on the computer, (but no more). There are many creative ways to integrate a person into the recognition pipeline, allowing her to correct, give hints, or direct the computation. This work constitutes an effort in this direction. Our first attempt to bring the human into OMR pipeline built a user interface allowing the correction of individual primitives: stem, beam, note head, single flag, sharp, augmentation dot, etc. Thus the user’s task was simply to cover the image ink by adding and deleting appropriate primitives. A benefit of this approach is that it presents the user with a clearly-defined task that doesn’t require knowledge of the system’s inner workings. There are, however, several weaknesses to this approach: the human tagging process is laborious; it fails to provide important syntactic relations between primitives; it requires the person to precisely register the primitive with the image; and it allows the person to create uninterpretable configurations of primitives (say a stem with no note head) creating havoc further down the OMR pipeline. Our aim here is to improve on all these weaknesses while still presenting a simple task to the user. Our current approach first presents the user with the original recognition results, obtained through fully automatic means. The user may then label any individual pixel according to the recognition task at hand. For instance, during system recognition the user may label a pixel as white space or bar line, while during measure recognition we use a richer collection of labels including, closed/half/whole note head, stem, ledger line, beam, sharp, single flag, etc. The system then re-recognizes subject to the user-imposed constraint. Since our recognizers embed highly restrictive assumptions on the primitives they assemble, a single correction often fixes a number of problems at once. Human and machine then iterate the process of providing and synthesizing human-supplied constraints into recognized results. This approach leaves the registration problem — the precise location of primitives — in the hands of the machine, where we believe it belongs. Furthermore, since our system can only recognize meaningful configurations of symbols, we avoid the problem of trying to assemble human-tagged composite symbols that may not make sense. While the resulting process may still be laborious, our results indicate that the human burden can be reduced considerably by employing this strategy. Furthermore, there are many other ways of introducing human-specified constraints into the recognition process, thus the current effort constitutes an initial exploration of a longer-term goal. Interactive OMR Various authors, such as Rebelo [13], suggest that interactive OMR system could be a realistic solution to the problem, though the central challenge of fusing the human and machine contributions still remains open. Human-in-the-loop computation has received considerable attention recently [23]. It has been applied to a wide variety of areas, such as retrieval systems [19], object classification [20], character recognition [18], document indexing [25], image labeling [22] and fined-grained visual categorization [21]. Romero [26] proposed a Hidden Markov Model (HMM) for computer-assisted text transcription, in which the user-imposed prefix is used to constrain both the sequence decoding and language priors. The potential of all these different applications is summarized in von Ahn’s statement [18]: “Human processing power can be harnessed to solve problems that computer cannot yet solve.” There have already been several OMR systems taking into account human-in-the-loop computation. For instance, Fujinaga [4] proposed an adaptive system that could incrementally improve its symbol classifiers based on human feedback. Church [6] implemented an interface accepting user feedback to guide misrecognized measures toward similar correct measures found elsewhere in the score. Our system uses human feedback in an entirely different manner — as a means of constraining the recognition process in a user-specified manner, thus leveraging the user’s input in the heart of the system. It is worth noting that our approach constitutes a generic framework that poses human-in-theloop recognition as constrained optimization, applicable beyond the specific confines of OMR. Human-Directed Recognition As motivation consider the example given in Figure 1. Suppose our recognition misses the upper note head of the chord (Figure 1b). Then suppose the user labels a single pixel that belongs to the missing note head as solid head (Figure 1c). When the system re-recognizes subject to this constraint, the note head, its associated ledger line, accidental and stem portion may all b

查看原文本刊更多论文

人为光学音乐识别

鉴于这些障碍和我们共同的过去历史，押注于完全自动化的OMR系统将产生任何一致性的高质量结果似乎是不明智的。相反，我们倾向于将问题视为交互式问题，从而将人类置于计算循环中。在这种情况下，基本的挑战变成了最小化用户的努力，给计算机施加尽可能多的负担(但不能更多)。有许多创造性的方法可以将一个人整合到识别管道中，允许她纠正、给出提示或指导计算。这项工作是朝这个方向努力的结果。我们第一次尝试将人带入OMR管道，建立了一个用户界面，允许校正单个原语:干，束，注释头，单旗，尖，增强点等。因此，用户的任务只是通过添加和删除适当的原语来覆盖图像墨水。这种方法的一个好处是，它为用户提供了一个定义清晰的任务，而不需要了解系统的内部工作原理。然而，这种方法有几个缺点:人工标记过程很费力;它没有提供原语之间重要的语法关系;它要求人精确地将原语与图像匹配;它允许人们创建不可解释的原语配置(比如没有注释头的词干)，从而进一步破坏OMR管道。我们的目标是在向用户呈现简单任务的同时改进所有这些缺点。我们目前的方法首先向用户展示通过全自动手段获得的原始识别结果。然后，用户可以根据手头的识别任务标记任何单个像素。例如，在系统识别过程中，用户可以将一个像素标记为空白或条形线，而在测量识别过程中，我们使用更丰富的标签集合，包括闭/半/全音符头，干，分类线，梁，尖，单旗等。然后，系统根据用户施加的约束重新识别对象。由于我们的识别器在它们组装的原语中嵌入了高度限制性的假设，因此一次修正通常会同时解决许多问题。然后，人类和机器迭代提供和综合人类提供的约束为可识别的结果的过程。这种方法将注册问题——原语的精确位置——留给了机器，我们认为这是它的归属。此外，由于我们的系统只能识别符号的有意义的配置，我们避免了试图组装可能没有意义的人工标记复合符号的问题。虽然由此产生的过程可能仍然很费力，但我们的结果表明，通过采用这一策略可以大大减轻人力负担。此外，还有许多其他方法可以将人类指定的约束引入识别过程，因此，当前的努力构成了对长期目标的初步探索。交互式OMR许多作者，如Rebelo[13]，认为交互式OMR系统可能是解决这个问题的现实方案，尽管融合人类和机器贡献的核心挑战仍然没有解决。人在循环计算最近受到了相当大的关注。它被广泛应用于检索系统[19]、对象分类[20]、字符识别[18]、文档索引[25]、图像标注[22]和细粒度视觉分类[21]等领域。Romero[26]提出了一种用于计算机辅助文本转录的隐马尔可夫模型(HMM)，其中使用用户强加的前缀来约束序列解码和语言先验。所有这些不同应用的潜力在冯·安的声明b[18]中得到了总结:“人类的处理能力可以用来解决计算机还无法解决的问题。”已经有几个OMR系统考虑了人在循环中的计算。例如，Fujinaga[4]提出了一种自适应系统，可以根据人类的反馈逐步改进其符号分类器。Church[6]实现了一个接受用户反馈的界面，将错误的度量引导到在分数中其他地方发现的类似的正确度量。我们的系统以一种完全不同的方式使用人类反馈——作为一种以用户指定的方式约束识别过程的手段，从而在系统的核心利用用户的输入。值得注意的是，我们的方法构成了一个通用框架，将人在循环识别作为约束优化，适用于OMR的特定限制之外。考虑图1中给出的示例作为动机。假设我们的识别忽略了和弦的上音头(图1b)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Document Recognition and Retrieval

自引率

0.00%

发文量