从法庭档案到数字研究设施(TRIADO):探索如何使档案可访问和使用

Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer
{"title":"从法庭档案到数字研究设施(TRIADO):探索如何使档案可访问和使用","authors":"Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer","doi":"10.1145/3322905.3322906","DOIUrl":null,"url":null,"abstract":"The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). TRIADO explores technological strategies to transform analogue text-based archival collections into digital data that can be used for research. The first part of the project is about trying out new techniques to open up collections, the second part is a 'reality check' to explore the research potential of the data created. Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research. The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten). The experimental pilot project TRIADO focuses on two complementary research questions: 1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility? 2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1? A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"731 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable\",\"authors\":\"Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer\",\"doi\":\"10.1145/3322905.3322906\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). TRIADO explores technological strategies to transform analogue text-based archival collections into digital data that can be used for research. The first part of the project is about trying out new techniques to open up collections, the second part is a 'reality check' to explore the research potential of the data created. Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research. The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten). The experimental pilot project TRIADO focuses on two complementary research questions: 1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility? 2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1? A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.\",\"PeriodicalId\":418911,\"journal\":{\"name\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"volume\":\"731 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3322905.3322906\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322906","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

TRIADO项目(2016-2019)是network Oorlogsbronnen(协调员)、NIOD战争、大屠杀和种族灭绝研究所、Huygens ING/ know人文集群和荷兰国家档案馆(国家档案馆)之间的合作项目。TRIADO探索技术策略,将基于文本的模拟档案收藏品转换为可用于研究的数字数据。该项目的第一部分是尝试新技术来打开收藏,第二部分是“现实检查”,以探索所创建数据的研究潜力。越来越多的档案馆、图书馆和博物馆(ALMs)将它们的模拟历史藏品数字化。然而,据2017年估计,到目前为止,欧洲只有大约十分之一的遗产藏品被数字化。数字人文社区的具体需求与慈善机构提供的数字“原材料”之间仍然存在很大差距。基于文本的历史收藏对许多不同的科学学科都有潜在的吸引力,但是到目前为止——以荷兰为例——只有少数数字化档案配备了用于数字化研究的设备。TRIADO的主要目的是通过执行“实验室到现实”来弥合这一差距-与荷兰最常被咨询的二战档案:特别管辖权中央档案馆(CABR)进行核对。国家档案馆(荷兰国家档案馆)持有的CABR包括被控与德国占领者勾结的约30万人的法律案件档案。CABR包含大约4公里的模拟文件(货架空间),从会议纪要和判决书到会员卡、表格和传票。大多数文档都是打印的或混合的(打印/手写)。实验性试点项目TRIADO侧重于两个互补的研究问题:1。哪种数字方法(在质量、效率等方面)最适合制作基于模拟收集的大型非结构化、不完美数据语料库,作为研究设施使用?2. 是否有可能在1下创建的数字数据的基础上回答具体的、主要是定量的统计研究问题?对距CABR 13.8米的样本进行了数字化,以测试技术并进行实验。同时,设计了大规模数字化的工作流程,并建立了演示器来展示实验结果。在本文中,我们讨论了第一部分研究的主要发现。本文报告了大规模数字化、OCR质量和改进、文档类型自动分类、命名实体识别、日期提取以及现有名单与OCR数据的匹配过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable
The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). TRIADO explores technological strategies to transform analogue text-based archival collections into digital data that can be used for research. The first part of the project is about trying out new techniques to open up collections, the second part is a 'reality check' to explore the research potential of the data created. Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research. The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten). The experimental pilot project TRIADO focuses on two complementary research questions: 1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility? 2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1? A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信