中古荷兰语的数据驱动音节化

Digital Medievalist Pub Date : 2019-11-04 DOI:10.16995/dm.83

Wouter Haverals, Folgert Karsdorp, M. Kestemont

{"title":"中古荷兰语的数据驱动音节化","authors":"Wouter Haverals, Folgert Karsdorp, M. Kestemont","doi":"10.16995/dm.83","DOIUrl":null,"url":null,"abstract":"The task of automatically separating Middle Dutch words into syllables is a challenging one. A first method was presented by Bouma and Hermans (2012), who combined a rule-based finite-state component with data-driven error correction. Achieving an average word accuracy of 96.5%, their system surely is a satisfactory one, although it leaves room for improvement. Generally speaking, rule-based methods are less attractive for dealing with a medieval language like Middle Dutch, where not only each dialect has its own spelling preferences, but where there is also much idiosyncratic variation among scribes. This paper presents a different method for the task of automatically syllabifying Middle Dutch words, which does not rely on a set of pre-defined linguistic information. Using a Recurrent Neural Network (RNN) with Long-Short-Term Memory cells (LSTM), we obtain a system which outperforms the rule-based method both in robustness and in effort.","PeriodicalId":440678,"journal":{"name":"Digital Medievalist","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Data-Driven Syllabification for Middle Dutch\",\"authors\":\"Wouter Haverals, Folgert Karsdorp, M. Kestemont\",\"doi\":\"10.16995/dm.83\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of automatically separating Middle Dutch words into syllables is a challenging one. A first method was presented by Bouma and Hermans (2012), who combined a rule-based finite-state component with data-driven error correction. Achieving an average word accuracy of 96.5%, their system surely is a satisfactory one, although it leaves room for improvement. Generally speaking, rule-based methods are less attractive for dealing with a medieval language like Middle Dutch, where not only each dialect has its own spelling preferences, but where there is also much idiosyncratic variation among scribes. This paper presents a different method for the task of automatically syllabifying Middle Dutch words, which does not rely on a set of pre-defined linguistic information. Using a Recurrent Neural Network (RNN) with Long-Short-Term Memory cells (LSTM), we obtain a system which outperforms the rule-based method both in robustness and in effort.\",\"PeriodicalId\":440678,\"journal\":{\"name\":\"Digital Medievalist\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Medievalist\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.16995/dm.83\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Medievalist","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.16995/dm.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

自动将中古荷兰语单词分成音节是一项具有挑战性的任务。第一种方法是由Bouma和Hermans(2012)提出的，他们将基于规则的有限状态组件与数据驱动的纠错相结合。他们的系统平均单词准确率达到96.5%，虽然还有改进的空间，但确实令人满意。一般来说，基于规则的方法对于处理中世纪的语言不太有吸引力，比如中古荷兰语，在那里，不仅每种方言都有自己的拼写偏好，而且抄写员之间也有很多特殊的差异。本文提出了一种不依赖于预先定义的语言信息的中古荷兰语单词自动音节化方法。利用一种具有长短期记忆单元(LSTM)的递归神经网络(RNN)，我们得到了一个在鲁棒性和工作量上都优于基于规则的方法的系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data-Driven Syllabification for Middle Dutch

The task of automatically separating Middle Dutch words into syllables is a challenging one. A first method was presented by Bouma and Hermans (2012), who combined a rule-based finite-state component with data-driven error correction. Achieving an average word accuracy of 96.5%, their system surely is a satisfactory one, although it leaves room for improvement. Generally speaking, rule-based methods are less attractive for dealing with a medieval language like Middle Dutch, where not only each dialect has its own spelling preferences, but where there is also much idiosyncratic variation among scribes. This paper presents a different method for the task of automatically syllabifying Middle Dutch words, which does not rely on a set of pre-defined linguistic information. Using a Recurrent Neural Network (RNN) with Long-Short-Term Memory cells (LSTM), we obtain a system which outperforms the rule-based method both in robustness and in effort.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Medievalist

自引率

0.00%

发文量