破解破碎的平板电脑上的代码:标准化2D和3D数据集中注释楔形文字的学习挑战

2019 International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2019-09-01 DOI:10.1109/ICDAR.2019.00032

H. Mara, B. Bogacz

{"title":"破解破碎的平板电脑上的代码:标准化2D和3D数据集中注释楔形文字的学习挑战","authors":"H. Mara, B. Bogacz","doi":"10.1109/ICDAR.2019.00032","DOIUrl":null,"url":null,"abstract":"The number of known cuneiform tablets is assumed to be in the hundreds of thousands. The Hilprecht Archive Online contains 1977 high-resolution 3D scans of tablets. The online cuneiform database CDLI catalogs metadata for more than 100.000 tablets. While both are accessible publicly, large-scale machine learning and pattern recognition on cuneiform tablets remain elusive. The data is only accessible by searching web pages, the tablet identifiers between collections are inconsistent, and the 3D data is unprepared and challenging for automated processing. We pave the way for large-scale analyses of cuneiform tablets by assembling a cross-referenced benchmark dataset of processed cuneiform tablets: (i) frontally aligned 3D tablets with pre-computed high-dimensional surface features, (ii) six-views raster images for off-the-shelf image processing, and (iii) metadata, transcriptions, and transliterations, for a subset of 707 tablets, for learning alignment between 3D data, image and linguistic expression. This is the first dataset of its kind and of its size in cuneiform research. This benchmark dataset is prepared for ease-of-use and immediate availability for computational researches, lowering the barrier to experiment and apply standard methods of analysis, at https://doi.org/10.11588/data/IE8CCN.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Breaking the Code on Broken Tablets: The Learning Challenge for Annotated Cuneiform Script in Normalized 2D and 3D Datasets\",\"authors\":\"H. Mara, B. Bogacz\",\"doi\":\"10.1109/ICDAR.2019.00032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The number of known cuneiform tablets is assumed to be in the hundreds of thousands. The Hilprecht Archive Online contains 1977 high-resolution 3D scans of tablets. The online cuneiform database CDLI catalogs metadata for more than 100.000 tablets. While both are accessible publicly, large-scale machine learning and pattern recognition on cuneiform tablets remain elusive. The data is only accessible by searching web pages, the tablet identifiers between collections are inconsistent, and the 3D data is unprepared and challenging for automated processing. We pave the way for large-scale analyses of cuneiform tablets by assembling a cross-referenced benchmark dataset of processed cuneiform tablets: (i) frontally aligned 3D tablets with pre-computed high-dimensional surface features, (ii) six-views raster images for off-the-shelf image processing, and (iii) metadata, transcriptions, and transliterations, for a subset of 707 tablets, for learning alignment between 3D data, image and linguistic expression. This is the first dataset of its kind and of its size in cuneiform research. This benchmark dataset is prepared for ease-of-use and immediate availability for computational researches, lowering the barrier to experiment and apply standard methods of analysis, at https://doi.org/10.11588/data/IE8CCN.\",\"PeriodicalId\":325437,\"journal\":{\"name\":\"2019 International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2019.00032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

据推测，已知的楔形文字碑的数量有数十万块。希尔普雷希特在线档案包含1977年高分辨率3D扫描的平板电脑。在线楔形文字数据库CDLI收录了超过10万个平板电脑的元数据。虽然两者都可以公开访问，但楔形文字平板电脑的大规模机器学习和模式识别仍然难以捉摸。这些数据只能通过搜索网页来获取，收集之间的平板电脑标识符不一致，3D数据还没有准备好，难以进行自动化处理。我们通过组装一个经过处理的楔形文字平板的交叉参考基准数据集，为大规模分析楔形文字平板铺平了道路:(i)具有预先计算的高维表面特征的正面对齐3D平板，(ii)用于现成图像处理的六视图光栅图像，以及(iii)用于707块平板子集的元数据，转录和转写，用于学习3D数据，图像和语言表达之间的对齐。这是楔形文字研究中第一个这种类型和规模的数据集。这个基准数据集是为易于使用和立即可用的计算研究而准备的，降低了实验和应用标准分析方法的障碍，网址为https://doi.org/10.11588/data/IE8CCN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Breaking the Code on Broken Tablets: The Learning Challenge for Annotated Cuneiform Script in Normalized 2D and 3D Datasets

The number of known cuneiform tablets is assumed to be in the hundreds of thousands. The Hilprecht Archive Online contains 1977 high-resolution 3D scans of tablets. The online cuneiform database CDLI catalogs metadata for more than 100.000 tablets. While both are accessible publicly, large-scale machine learning and pattern recognition on cuneiform tablets remain elusive. The data is only accessible by searching web pages, the tablet identifiers between collections are inconsistent, and the 3D data is unprepared and challenging for automated processing. We pave the way for large-scale analyses of cuneiform tablets by assembling a cross-referenced benchmark dataset of processed cuneiform tablets: (i) frontally aligned 3D tablets with pre-computed high-dimensional surface features, (ii) six-views raster images for off-the-shelf image processing, and (iii) metadata, transcriptions, and transliterations, for a subset of 707 tablets, for learning alignment between 3D data, image and linguistic expression. This is the first dataset of its kind and of its size in cuneiform research. This benchmark dataset is prepared for ease-of-use and immediate availability for computational researches, lowering the barrier to experiment and apply standard methods of analysis, at https://doi.org/10.11588/data/IE8CCN.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量