Creating a large-scale diachronic corpus resource: Automated parsing in the Greek papyri (and beyond)

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2023-08-15 DOI:10.1017/s1351324923000384

Alek Keersmaekers, Toon van Hal

{"title":"Creating a large-scale diachronic corpus resource: Automated parsing in the Greek papyri (and beyond)","authors":"Alek Keersmaekers, Toon van Hal","doi":"10.1017/s1351324923000384","DOIUrl":null,"url":null,"abstract":"\n This paper explores how to syntactically parse Ancient Greek texts automatically and maps ways of fruitfully employing the results of such an automated analysis. Special attention is given to documentary papyrus texts, a large diachronic corpus of non-literary Greek, which presents a unique set of challenges to tackle. By making use of the Stanford Graph-Based Neural Dependency Parser, we show that through careful curation of the parsing data and several manipulation strategies, it is possible to achieve an Labeled Attachment Score of about 0.85 for this corpus. We also explain how the data can be converted back to its original (Ancient Greek Dependency Treebanks) format. We describe the results of several tests we have carried out to improve parsing results, with special attention paid to the impact of the annotation format on parser achievements. In addition, we offer a detailed qualitative analysis of the remaining errors, including possible ways to solve them. Moreover, the paper gives an overview of the valorisation possibilities of an automatically annotated corpus of Ancient Greek texts in the fields of linguistics, language education and humanities studies in general. The concluding section critically analyses the remaining difficulties and outlines avenues to further improve the parsing quality and the ensuing practical applications.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000384","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper explores how to syntactically parse Ancient Greek texts automatically and maps ways of fruitfully employing the results of such an automated analysis. Special attention is given to documentary papyrus texts, a large diachronic corpus of non-literary Greek, which presents a unique set of challenges to tackle. By making use of the Stanford Graph-Based Neural Dependency Parser, we show that through careful curation of the parsing data and several manipulation strategies, it is possible to achieve an Labeled Attachment Score of about 0.85 for this corpus. We also explain how the data can be converted back to its original (Ancient Greek Dependency Treebanks) format. We describe the results of several tests we have carried out to improve parsing results, with special attention paid to the impact of the annotation format on parser achievements. In addition, we offer a detailed qualitative analysis of the remaining errors, including possible ways to solve them. Moreover, the paper gives an overview of the valorisation possibilities of an automatically annotated corpus of Ancient Greek texts in the fields of linguistics, language education and humanities studies in general. The concluding section critically analyses the remaining difficulties and outlines avenues to further improve the parsing quality and the ensuing practical applications.

查看原文本刊更多论文

创建大规模历时语料库资源：希腊纸莎草书中的自动解析（及其后）

本文探讨了如何自动语法分析古希腊文本，并绘制了如何有效利用这种自动分析结果的方法。特别关注的是文献纸莎草文本，这是一个非文学希腊语的大型历时语料库，它提出了一系列独特的挑战。通过使用基于斯坦福图的神经依赖性解析器，我们表明，通过仔细管理解析数据和几种操作策略，该语料库的标记依恋得分可能达到0.85左右。我们还解释了如何将数据转换回其原始格式（古希腊依赖树库）。我们描述了为改进解析结果而进行的几次测试的结果，并特别注意注释格式对解析结果的影响。此外，我们对剩余的错误进行了详细的定性分析，包括解决这些错误的可能方法。此外，本文还概述了古希腊文本自动注释语料库在语言学、语言教育和人文科学研究领域的估价可能性。结论部分批判性地分析了剩余的困难，并概述了进一步提高解析质量的途径和随后的实际应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.