J. Lang. Technol. Comput. Linguistics最新文献_第2页

Krill: KorAP search and analysis engine Krill: KorAP搜索和分析引擎

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.202

Nils Diewald, Eliza Margaretha

{"title":"Krill: KorAP search and analysis engine","authors":"Nils Diewald, Eliza Margaretha","doi":"10.21248/jlcl.31.2016.202","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.202","url":null,"abstract":"KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127392809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Automatisierter Abgleich des Lautstandes althochdeut-scher Wortformen 自动化对比声音的老调

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.209

Roland Mittmann

{"title":"Automatisierter Abgleich des Lautstandes althochdeut-scher Wortformen","authors":"Roland Mittmann","doi":"10.21248/jlcl.31.2016.209","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.209","url":null,"abstract":"Um Texte einer Sprache automatisiert auf ihren möglichen Entstehungszeitraum und ihre dialektale Zugehörigkeit hin zu untersuchen, werden für jedes erwartete Graphem und jede Flexionsendung zunächst Entsprechungsregeln zwischen einer idealisierten Sprachform und den Sprachformen in Grammatiken beschriebener Zeit-Dialekt-Räume erfasst. Anschließend werden mithilfe eines Computerprogramms unter Anwendung dieser Regeln die belegten Wortformen mit ihren Entsprechungen in der idealisierten Sprachform abgeglichen und für jeden Text die Übereinstimmungsgrade mit den einzelnen Zeit-Dialekt-Räumen angegeben. Exemplarisch wird dieser Abgleich für eine althochdeutsche Wortform beschrieben und das Ergebnis der Analyse des zugehörigen Gesamttextes dargestellt. 1 Untersuchungsthema Seit jeher verändern sich Sprachen im Laufe der zeitlichen Entwicklung. Sobald ihre Sprechergemeinschaften in verschiedene Gruppen zerfallen, die nicht mehr dauerhaft miteinander in Kontakt stehen, entwickeln sie zudem verschiedene Varietäten. Solange die Normierung einer Sprache nicht erfolgt ist, bleibt die textliche Überlieferung daher sprachlich uneinheitlich. Auch innerhalb eines Textes können Schwankungen auftreten, etwa wenn Sprecher verschiedener Dialekte am selben Text arbeiten oder einen bestehenden Text korrigieren (vgl. etwa BRAUNE/REIFFENSTEIN 2004, § 3 und Anm. 1). Ein einzelner Autor kann ebenfalls verschiedenen dialektalen Einflüssen unterworfen sein oder die im Laufe seines Lebens erfolgte sprachliche Veränderung in seinen Niederschriften wiedergeben. Da vor der Erfindung des Buchdrucks Texte allein durch Abschrift vervielfältigt wurden, kam es schließlich auch seitens der Kopisten – bewusst oder unbewusst – zu sprachlichen Anpassungen bei dialektalen Formen bzw. infolge der zeitlichen Entwicklung. Sind zu einem Teil der textlichen Überlieferung einer Sprache keine genaueren zeitlichen und örtlichen Angaben bekannt, erscheint es denkbar, diese automatisiert auf ihre Übereinstimmung mit verschiedenen Zeit-Dialekt-Räumen – also Zeitabschnitten mit Bezug auf die verschiedenen örtlichen Varietäten – zu untersuchen. Diese Untersuchung wird im Folgenden beschrieben. Voraussetzung dafür ist, dass Angaben zu den üblichen Entsprechungen der verschiedenen Phonem-GraphEntsprechungen (Lautverschriftungen, vgl. MITTMANN 2015b, 248) und der Flexionsendungen in den einzelnen Zeit-Dialekt-Räumen vorliegen.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134190084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketchfor Aiding Reconstruction 格皮:古格鲁吉亚语铭文语料库和辅助重建的工具草图

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.210

Armin Hoenen, Lela Samushia

引用次数: 1

ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access 中古高地德语参考语料库——语料库的汇编、注释和访问

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.208

Florian Petran, Marcel Bollmann, Stefanie Dipper, Thomas Klein

{"title":"ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access","authors":"Florian Petran, Marcel Bollmann, Stefanie Dipper, Thomas Klein","doi":"10.21248/jlcl.31.2016.208","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.208","url":null,"abstract":"This paper describes ReM and the results of the ReM project and its predecessors. All projects closely collaborate in developing common annotation standards to allow for diachronic investigations. ReA has already been published and made available via the corpus search tool ANNIS2 (Krause and Zeldes, 2016), while ReF and ReN are still in the annotation process. The ReM project builds on several earlier annotation efforts, such as the corpus of the new Middle High German Grammar (MiGraKo, Klein et al. (2009)), expanding them and adding further texts, to produce a reference corpus for Middle High German, which we will also call “ReM” for short. The combined corpus, which consists of around two million tokens, provides a mostly complete collection of written records from Early Middle High German (1050–1200) as well as a selection of Middle High German texts from 1200 to 1350. Texts have been digitized and annotated with parts of speech and morphology (using the HiTS tagset, cf. Dipper et al. (2013)) as well as lemma information. Release 1.0 of ReM has been published in December 2016 and is also accessible via the ANNIS tool. The project website at https://www.linguistics.ruhr-uni-bochum. de/rem/ offers extensive documentation of the project and the corpus. The corpus","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122043735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Merging and validating heterogenous, multi-layered corpora with discoursegraphs 用语篇合并和验证异质、多层语料库

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.204

Arne Neumann

引用次数: 1

SpoCo - a simple and adaptable web interface for dialect corpora SpoCo -一个简单和适应性强的方言语料库web界面

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.206

R. Waldenfels, Michal Wozniak

引用次数: 4

graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora graphANNIS:深度标注语料库的快速查询引擎

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.199

Thomas Krause, U. Leser, Anke Lüdeling

{"title":"graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora","authors":"Thomas Krause, U. Leser, Anke Lüdeling","doi":"10.21248/jlcl.31.2016.199","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.199","url":null,"abstract":"We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of ﬁnding paths in a graph and present eﬃcient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131466472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

PAL, a tool for Pre-annotation and Active Learning PAL，预标注和主动学习工具

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.203

Maria Skeppstedt, C. Paradis, A. Kerren

引用次数: 17

Data Mining Software for Corpus Linguistics with Application in Diachronic Linguistics 语料库语言学数据挖掘软件及其在历时语言学中的应用

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.201

Christian Pölitz

引用次数: 0

Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project 口语交互语料库的构建和传播——FOLK项目中的工具和工作流程

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.205

Thomas C. Schmidt

引用次数: 11