J. Lang. Technol. Comput. Linguistics最新文献

筛选
英文 中文
Krill: KorAP search and analysis engine Krill: KorAP搜索和分析引擎
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.202
Nils Diewald, Eliza Margaretha
{"title":"Krill: KorAP search and analysis engine","authors":"Nils Diewald, Eliza Margaretha","doi":"10.21248/jlcl.31.2016.202","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.202","url":null,"abstract":"KorAP1 (Korpusanalyseplattform) is a corpus search and analysis platform for handling very large corpora with multiple annotation layers, multiple query languages, and complex licensing models (Bański et al., 2013a). It is intended to succeed the COSMAS II system (Bodmer, 1996) in providing DEREKO, the German reference corpus (Kupietz and Lüngen, 2014), hosted by the Institute for the German Language (IDS).2 The corpus consists of a wide range of texts such as fiction, newspaper articles and scripted speech, annotated on multiple linguistic levels, for instance part-of-speech and syntactic dependency structures. It was reported to contain approximately 30 billion words in September 2016 and still grows continually. Krill3 (Corpus-data Retrieval Index using Lucene for Look-ups) is a corpus search engine that serves as a search component in KorAP. It is based on Apache Lucene,4 a popular and well-established information retrieval engine. Lucene’s lightweight memory requirements and scalable indexing are suitable for handling large corpora whose size increases rapidly. It supports full-text search for many query types including phrase and wildcard queries, and allows custom implementations to cope with complex linguistic queries. In this paper, we describe Krill and how its index is designed to handle full-text and complex annotation search combining different annotation layers and sources of very large corpora. The paper is structured as follows. Section 2 describes how a search works in KorAP (starting from receiving a search request until returning the search results). Section 3 explains how corpus data are represented and indexed in Krill. Section 4 describes various kinds of queries handled by Krill and how they are processed for the actual search on the index. The Krill response format containing search results is described in Section 5. We present related and further work in Section 6 and 7 respectively. The paper ends with a summary.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127392809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Automatisierter Abgleich des Lautstandes althochdeut-scher Wortformen 自动化对比声音的老调
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.209
Roland Mittmann
{"title":"Automatisierter Abgleich des Lautstandes althochdeut-scher Wortformen","authors":"Roland Mittmann","doi":"10.21248/jlcl.31.2016.209","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.209","url":null,"abstract":"Um Texte einer Sprache automatisiert auf ihren möglichen Entstehungszeitraum und ihre dialektale Zugehörigkeit hin zu untersuchen, werden für jedes erwartete Graphem und jede Flexionsendung zunächst Entsprechungsregeln zwischen einer idealisierten Sprachform und den Sprachformen in Grammatiken beschriebener Zeit-Dialekt-Räume erfasst. Anschließend werden mithilfe eines Computerprogramms unter Anwendung dieser Regeln die belegten Wortformen mit ihren Entsprechungen in der idealisierten Sprachform abgeglichen und für jeden Text die Übereinstimmungsgrade mit den einzelnen Zeit-Dialekt-Räumen angegeben. Exemplarisch wird dieser Abgleich für eine althochdeutsche Wortform beschrieben und das Ergebnis der Analyse des zugehörigen Gesamttextes dargestellt. 1 Untersuchungsthema Seit jeher verändern sich Sprachen im Laufe der zeitlichen Entwicklung. Sobald ihre Sprechergemeinschaften in verschiedene Gruppen zerfallen, die nicht mehr dauerhaft miteinander in Kontakt stehen, entwickeln sie zudem verschiedene Varietäten. Solange die Normierung einer Sprache nicht erfolgt ist, bleibt die textliche Überlieferung daher sprachlich uneinheitlich. Auch innerhalb eines Textes können Schwankungen auftreten, etwa wenn Sprecher verschiedener Dialekte am selben Text arbeiten oder einen bestehenden Text korrigieren (vgl. etwa BRAUNE/REIFFENSTEIN 2004, § 3 und Anm. 1). Ein einzelner Autor kann ebenfalls verschiedenen dialektalen Einflüssen unterworfen sein oder die im Laufe seines Lebens erfolgte sprachliche Veränderung in seinen Niederschriften wiedergeben. Da vor der Erfindung des Buchdrucks Texte allein durch Abschrift vervielfältigt wurden, kam es schließlich auch seitens der Kopisten – bewusst oder unbewusst – zu sprachlichen Anpassungen bei dialektalen Formen bzw. infolge der zeitlichen Entwicklung. Sind zu einem Teil der textlichen Überlieferung einer Sprache keine genaueren zeitlichen und örtlichen Angaben bekannt, erscheint es denkbar, diese automatisiert auf ihre Übereinstimmung mit verschiedenen Zeit-Dialekt-Räumen – also Zeitabschnitten mit Bezug auf die verschiedenen örtlichen Varietäten – zu untersuchen. Diese Untersuchung wird im Folgenden beschrieben. Voraussetzung dafür ist, dass Angaben zu den üblichen Entsprechungen der verschiedenen Phonem-GraphEntsprechungen (Lautverschriftungen, vgl. MITTMANN 2015b, 248) und der Flexionsendungen in den einzelnen Zeit-Dialekt-Räumen vorliegen.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134190084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketchfor Aiding Reconstruction 格皮:古格鲁吉亚语铭文语料库和辅助重建的工具草图
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.210
Armin Hoenen, Lela Samushia
{"title":"Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketchfor Aiding Reconstruction","authors":"Armin Hoenen, Lela Samushia","doi":"10.21248/jlcl.31.2016.210","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.210","url":null,"abstract":"In the current paper, an annotated corpus of Old Georgian inscriptions is introduced. The corpus contains 91 inscriptions which have been annotated in the standard epigraphic XML format EpiDoc, part of the TEI. Secondly, a prototype tool for helping epigraphic reconstruction is designed based on the inherent needs of epigraphy. The prototype backend uses word embeddings and frequencies generated from a corpus of Old Georgian to determine possible gap fillers. The method is applied to the gaps in the corpus and generates promising results. A sketch of a front end is being designed.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"403 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123094344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access 中古高地德语参考语料库——语料库的汇编、注释和访问
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.208
Florian Petran, Marcel Bollmann, Stefanie Dipper, Thomas Klein
{"title":"ReM: A reference corpus of Middle High German - corpus compilation, annotation, and access","authors":"Florian Petran, Marcel Bollmann, Stefanie Dipper, Thomas Klein","doi":"10.21248/jlcl.31.2016.208","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.208","url":null,"abstract":"This paper describes ReM and the results of the ReM project and its predecessors. All projects closely collaborate in developing common annotation standards to allow for diachronic investigations. ReA has already been published and made available via the corpus search tool ANNIS2 (Krause and Zeldes, 2016), while ReF and ReN are still in the annotation process. The ReM project builds on several earlier annotation efforts, such as the corpus of the new Middle High German Grammar (MiGraKo, Klein et al. (2009)), expanding them and adding further texts, to produce a reference corpus for Middle High German, which we will also call “ReM” for short. The combined corpus, which consists of around two million tokens, provides a mostly complete collection of written records from Early Middle High German (1050–1200) as well as a selection of Middle High German texts from 1200 to 1350. Texts have been digitized and annotated with parts of speech and morphology (using the HiTS tagset, cf. Dipper et al. (2013)) as well as lemma information. Release 1.0 of ReM has been published in December 2016 and is also accessible via the ANNIS tool. The project website at https://www.linguistics.ruhr-uni-bochum. de/rem/ offers extensive documentation of the project and the corpus. The corpus","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122043735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Merging and validating heterogenous, multi-layered corpora with discoursegraphs 用语篇合并和验证异质、多层语料库
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.204
Arne Neumann
{"title":"Merging and validating heterogenous, multi-layered corpora with discoursegraphs","authors":"Arne Neumann","doi":"10.21248/jlcl.31.2016.204","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.204","url":null,"abstract":"We present discoursegraphs, a library and command-line application for the conversion and merging of linguistic annotations written in Python. The software reads and writes numerous formats for syntactic and discourse-related annotations, but also supports generic interchange formats. discoursegraphs models primary data and its annotations as a graph and is therefore able to merge multiple independent, possibly conflicting annotation layers into a unified representation. We show how this approach is beneficial for the revision and validation of a corpus with multiple conflicting, independently annotated layers.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121963152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SpoCo - a simple and adaptable web interface for dialect corpora SpoCo -一个简单和适应性强的方言语料库web界面
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.206
R. Waldenfels, Michal Wozniak
{"title":"SpoCo - a simple and adaptable web interface for dialect corpora","authors":"R. Waldenfels, Michal Wozniak","doi":"10.21248/jlcl.31.2016.206","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.206","url":null,"abstract":"We present SpoCo, a simple, yet effective system for the web-based query of dialect corpora encoded in ELAN that provides users with advanced concordancing functions, as well as the the possibility to edit and correct transcriptions if needed. SpoCo is easy to use and maintain, and can be adapted to different spoken corpora in a straightforward way. Simplicity is emphasized to facilitate use by a wide range of users and research groups, including those with limited technical and financial resources, and encourage collaboration and data exchange across such groups. Relying on existing technology and pursuing a modular architecture, SpoCo is developed bottom-up: it was initially devised for a specific dialect project and is being continually adapted for use in other projects in a network of Slavic dialect projects that cooperate in tool development and data sharing. SpoCo thus takes a middle position between systems that are developed for the purposes of a specific dialect corpus, on the one hand, and general-use systems designed for a wide range of data and usage cases, on the other.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130125241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora graphANNIS:深度标注语料库的快速查询引擎
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.199
Thomas Krause, U. Leser, Anke Lüdeling
{"title":"graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora","authors":"Thomas Krause, U. Leser, Anke Lüdeling","doi":"10.21248/jlcl.31.2016.199","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.199","url":null,"abstract":"We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graph-based abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show that the vast majority of the AQL functionality can be mapped to the basic operation of finding paths in a graph and present efficient implementations and index structures for this and all other required operations. We compare the performance of graphANNIS with that of the standard SQL-based implementation of AQL, using a workload of more than 3000 real-life queries on a set of 17 open corpora each with a size up to 3 Million tokens, whose annotations range from simple and linear part-of-speech tagging to deeply nested discourse structures. For the entire workload, graphANNIS is more than 40 times faster, and slower in less than 3% of the queries. graphANNIS as well as the workload and corpora used for evaluation are freely available at GitHub and the Zenodo Open Access archive.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131466472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
PAL, a tool for Pre-annotation and Active Learning PAL,预标注和主动学习工具
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.203
Maria Skeppstedt, C. Paradis, A. Kerren
{"title":"PAL, a tool for Pre-annotation and Active Learning","authors":"Maria Skeppstedt, C. Paradis, A. Kerren","doi":"10.21248/jlcl.31.2016.203","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.203","url":null,"abstract":"Many natural language processing systems rely on machine learning models that are trained on large amounts of manually annotated text data. The lack of sufficient amounts of annotated data is, however, a common obstacle for such systems, since manual annotation of text is often expensive and time-consuming. The aim of “PAL\", a tool for Pre-annotation and Active Learning” is to provide a ready-made package that can be used to simplify annotation and to reduce the amount of annotated data required to train a machine learning classifier. The package provides support for two techniques that have been shown to be successful in previous studies, namely active learning and pre-annotation. The output of the pre-annotation is provided in the annotation format of the annotation tool BRAT, but PAL is a stand-alone package that can be adapted to other formats. (Less)","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133925824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Data Mining Software for Corpus Linguistics with Application in Diachronic Linguistics 语料库语言学数据挖掘软件及其在历时语言学中的应用
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.201
Christian Pölitz
{"title":"Data Mining Software for Corpus Linguistics with Application in Diachronic Linguistics","authors":"Christian Pölitz","doi":"10.21248/jlcl.31.2016.201","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.201","url":null,"abstract":"Large digital copora have become a valuable resource for linguistic research. We introduce a software tool to efficiently perform Data Mining tasks for diachronic linguistics to investigate linguistic phenomena with respect to time. As a running example, we show a topic model that extracts different meanings from large digital copora over time.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project 口语交互语料库的构建和传播——FOLK项目中的工具和工作流程
J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.205
Thomas C. Schmidt
{"title":"Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project","authors":"Thomas C. Schmidt","doi":"10.21248/jlcl.31.2016.205","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.205","url":null,"abstract":"This paper is about the workflow for construction and dissemination of FOLK (Forschungs - und Lehrkorpus Gesprochenes Deutsch – Research and Teaching Corpus of Spoken German), a large corpus of authentic spoken interaction data, recorded on audio and video. Section 2 describes in detail the tools used in the individual steps of transcription, anonymization, orthographic normalization, lemmatization and POS tagging of the data, as well as some utilities used for corpus management. Section 3 deals with the DGD (Datenbank fur Gesprochenes Deutsch - Database of Spoken German) as a tool for distributing completed data sets and making them available for qualitative and quantitative analysis. In section 4, some plans for further development are sketched.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117140671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信