CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

International Conference on Language, Data, and Knowledge Pub Date : 1900-01-01 DOI:10.4230/OASIcs.LDK.2019.7

C. Chiarcos, Niko Schenk

{"title":"CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation","authors":"C. Chiarcos, Niko Schenk","doi":"10.4230/OASIcs.LDK.2019.7","DOIUrl":null,"url":null,"abstract":"The proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSVrelated data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms of annotation granularity, destructive against a reference tokenization), or present tokenization clashes (loss-less and non-destructive, but introducing empty tokens as place-holders for unaligned elements). We demonstrate the applicability of the tool on two use cases from natural language processing and computational philology. 2012 ACM Subject Classification Applied computing → Format and notation; Applied computing → Document management and text processing; Applied computing → Annotation; Software and its engineering → Interoperability","PeriodicalId":377119,"journal":{"name":"International Conference on Language, Data, and Knowledge","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Language, Data, and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.LDK.2019.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSVrelated data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms of annotation granularity, destructive against a reference tokenization), or present tokenization clashes (loss-less and non-destructive, but introducing empty tokens as place-holders for unaligned elements). We demonstrate the applicability of the tool on two use cases from natural language processing and computational philology. 2012 ACM Subject Classification Applied computing → Format and notation; Applied computing → Document management and text processing; Applied computing → Annotation; Software and its engineering → Interoperability

查看原文本刊更多论文

CoNLL-Merge:并发标记化和文本变化的有效协调

正确检测运行文本中的标记是模块化NLP管道的初始处理步骤。但是，定义这些最小单位的策略可能会有所不同，并且对同一文本的相互冲突的分析严重限制了随后的语言注释集成到共享表示中。作为解决方案，我们引入了CoNLL Merge，这是一种用于协调tsv相关数据模型的实用工具，例如，在具有非顺序，并发标记化的多层语料中，以及在自然语言处理中的集成组合中。CoNLL Merge在无人监督的情况下工作，不需要人工干预或外部数据源，并附带一个灵活的API，用于全自动合并例程，有效性和完整性检查。用户可以从几种合并策略中进行选择，要么保留引用标记化(可能会损失注释粒度)，要么创建一个由最小共享子标记组成的公共标记化层(在注释粒度方面无损失，对引用标记化具有破坏性)，要么呈现标记化冲突(无损且非破坏性，但引入空标记作为未对齐元素的占位符)。我们演示了该工具在自然语言处理和计算语言学两个用例上的适用性。2012 ACM学科分类应用计算→格式与符号;应用计算→文档管理和文本处理;应用计算→标注;软件及其工程→互操作性

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Language, Data, and Knowledge

自引率

0.00%

发文量