HANA:用于离线手写文本识别的手写名称数据库

IF 2.6 1区历史学 Q1 ECONOMICS

Explorations in Economic History Pub Date : 2023-01-01 DOI:10.1016/j.eeh.2022.101473

Christian M. Dahl , Torben S.D. Johansen , Emil N. Sørensen , Simon Wittrock

{"title":"HANA:用于离线手写文本识别的手写名称数据库","authors":"Christian M. Dahl , Torben S.D. Johansen , Emil N. Sørensen , Simon Wittrock","doi":"10.1016/j.eeh.2022.101473","DOIUrl":null,"url":null,"abstract":"<div><p>Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Perhaps the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contains more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.</p></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"87 ","pages":"Article 101473"},"PeriodicalIF":2.6000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HANA: A handwritten name database for offline handwritten text recognition\",\"authors\":\"Christian M. Dahl , Torben S.D. Johansen , Emil N. Sørensen , Simon Wittrock\",\"doi\":\"10.1016/j.eeh.2022.101473\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Perhaps the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contains more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.</p></div>\",\"PeriodicalId\":47413,\"journal\":{\"name\":\"Explorations in Economic History\",\"volume\":\"87 \",\"pages\":\"Article 101473\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Explorations in Economic History\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0014498322000511\",\"RegionNum\":1,\"RegionCategory\":\"历史学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECONOMICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Explorations in Economic History","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0014498322000511","RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

摘要

跨历史数据集连接个体的方法正在迅速发展，通常与基于人工智能的转录模型相结合。也许最重要的链接标识符是个人姓名。然而，个人姓名容易出现枚举和转录错误，尽管现代链接方法旨在处理此类挑战，但这些错误的来源至关重要，应尽量减少。为此，改进的转录方法和大规模数据库是至关重要的组成部分。本文描述了一个新建立的包含330多万个名字的大型数据库HANA，并提供了相关文档。该数据库包含超过10.5万个唯一的名字，总共有超过110万张个人姓名的图像，这对于将学习转移到其他设置很有用。我们提供了三个例子，在丹麦和美国人口普查数据上获得了显著提高的转录准确性。此外，我们还展示了深度学习模型自动从扫描文档中转录个人姓名的基准结果。通过使更具挑战性的大规模数据库公开可用，我们希望为手写文本识别培养更复杂、更准确和更健壮的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HANA: A handwritten name database for offline handwritten text recognition

Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Perhaps the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contains more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Explorations in Economic History Multiple-

CiteScore

2.50

自引率

8.70%

发文量

期刊介绍： Explorations in Economic History provides broad coverage of the application of economic analysis to historical episodes. The journal has a tradition of innovative applications of theory and quantitative techniques, and it explores all aspects of economic change, all historical periods, all geographical locations, and all political and social systems. The journal includes papers by economists, economic historians, demographers, geographers, and sociologists. Explorations in Economic History is the only journal where you will find "Essays in Exploration." This unique department alerts economic historians to the potential in a new area of research, surveying the recent literature and then identifying the most promising issues to pursue.