Automating Transliteration of Cuneiform from Parallel Lines with Sparse Data

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2017-11-01 DOI:10.1109/ICDAR.2017.106

B. Bogacz, Maximilian Klingmann, H. Mara

{"title":"Automating Transliteration of Cuneiform from Parallel Lines with Sparse Data","authors":"B. Bogacz, Maximilian Klingmann, H. Mara","doi":"10.1109/ICDAR.2017.106","DOIUrl":null,"url":null,"abstract":"Cuneiform tablets appertain to the oldest textual artifacts and are in extent comparable to texts written in Latin or ancient Greek. The Cuneiform Commentaries Project (CPP) from Yale University provides tracings of cuneiform tablets with annotated transliterations and translations. As a part of our work analyzing cuneiform script computationally with 3D-acquisition and word-spotting, we present a first approach for automatized learning of transliterations of cuneiform tablets based on a corpus of parallel lines. These consist of manually drawn cuneiform characters and their transliteration into an alphanumeric code. Since the Cuneiform script is only available as raster-data, we segment lines with a projection profile, extract Histogram of oriented Gradients (HoG) features, detect outliers caused by tablet damage, and align those features with the transliteration. We apply methods from part-of-speech tagging to learn a correspondence between features and transliteration tokens. We evaluate point-wise classification with K-Nearest Neighbors (KNN) and a Support Vector Machine (SVM); sequence classification with a Hidden Markov Model (HMM) and a Structured Support Vector Machine (SVM-HMM). Analyzing our findings, we reach the conclusion that the sparsity of data, inconsistent labeling and the variety of tracing styles do currently not allow for fully automatized transliterations with the presented approach. However, the pursuit of automated learning of transliterations is of great relevance as manual annotation in larger quantities is not viable, given the few experts capable of transcribing cuneiform tablets.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Cuneiform tablets appertain to the oldest textual artifacts and are in extent comparable to texts written in Latin or ancient Greek. The Cuneiform Commentaries Project (CPP) from Yale University provides tracings of cuneiform tablets with annotated transliterations and translations. As a part of our work analyzing cuneiform script computationally with 3D-acquisition and word-spotting, we present a first approach for automatized learning of transliterations of cuneiform tablets based on a corpus of parallel lines. These consist of manually drawn cuneiform characters and their transliteration into an alphanumeric code. Since the Cuneiform script is only available as raster-data, we segment lines with a projection profile, extract Histogram of oriented Gradients (HoG) features, detect outliers caused by tablet damage, and align those features with the transliteration. We apply methods from part-of-speech tagging to learn a correspondence between features and transliteration tokens. We evaluate point-wise classification with K-Nearest Neighbors (KNN) and a Support Vector Machine (SVM); sequence classification with a Hidden Markov Model (HMM) and a Structured Support Vector Machine (SVM-HMM). Analyzing our findings, we reach the conclusion that the sparsity of data, inconsistent labeling and the variety of tracing styles do currently not allow for fully automatized transliterations with the presented approach. However, the pursuit of automated learning of transliterations is of great relevance as manual annotation in larger quantities is not viable, given the few experts capable of transcribing cuneiform tablets.

查看原文本刊更多论文

稀疏数据下平行线中楔形文字的自动音译

楔形文字碑属于最古老的文字制品，在一定程度上可与拉丁语或古希腊语的文字相媲美。耶鲁大学的楔形文字评论项目(CPP)提供了带有注释音译和翻译的楔形文字碑文的追踪。作为我们使用3d获取和单词识别技术分析楔形文字的一部分，我们提出了基于平行线语料库的自动学习楔形文字转写的第一种方法。这些文字由手工绘制的楔形文字及其音译成字母数字代码组成。由于楔形文字只能作为栅格数据，我们使用投影轮廓分割线条，提取定向梯度直方图(HoG)特征，检测由平板损坏引起的异常值，并将这些特征与音译对齐。我们应用词性标注的方法来学习特征和音译标记之间的对应关系。我们用k近邻(KNN)和支持向量机(SVM)来评估逐点分类;基于隐马尔可夫模型(HMM)和结构化支持向量机(SVM-HMM)的序列分类。分析我们的发现，我们得出的结论是，数据的稀疏性、不一致的标签和各种各样的跟踪风格目前不允许使用所提出的方法实现完全自动化的音译。然而，追求自动的音译学习是非常重要的，因为大量的人工注释是不可行的，因为有能力抄写楔形文字的专家很少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量