Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

IF 2 3区管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

Journal of Documentation Pub Date : 2023-02-27 DOI:10.1108/jd-01-2022-0029

Dilawar Ali, Kenzo Milleville, S. Verstockt, N. van de Weghe, Sally Chambers, Julie M. Birkholz

{"title":"Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections","authors":"Dilawar Ali, Kenzo Milleville, S. Verstockt, N. van de Weghe, Sally Chambers, Julie M. Birkholz","doi":"10.1108/jd-01-2022-0029","DOIUrl":null,"url":null,"abstract":"PurposeHistorical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.Design/methodology/approachIn this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.FindingsThe results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.Originality/valueThe proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).","PeriodicalId":47969,"journal":{"name":"Journal of Documentation","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Documentation","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1108/jd-01-2022-0029","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

PurposeHistorical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.Design/methodology/approachIn this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.FindingsThe results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.Originality/valueThe proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).

查看原文本刊更多论文

元数据丰富的计算机视觉和机器学习方法提高历史报纸收藏的可搜索性

目的历史报刊收藏提供了丰富的关于过去的信息。尽管这些藏品的数字化大大提高了它们的可访问性，但大部分数字化的历史报纸藏品，如比利时皇家图书馆KBR的藏品，还无法在文章层面进行搜索。然而，基于人工智能的研究方法的最新发展，如文档布局分析，有可能进一步丰富元数据，以提高这些历史报纸收藏的可搜索性。本文旨在讨论上述问题。设计/方法论/方法在本文中，作者探讨了如何利用现有的计算机视觉和机器学习方法来改善数字化历史报纸的访问。为此，作者提出了一种工作流程，使用计算机视觉和机器学习方法（1）使用文档布局分析提供对数字化历史报纸收藏的文章级访问，（2）提取特定类型的文章（例如feuilletons——1938年Le Peuple的文学增刊），（3）使用（未）监督的分类方法进行图像相似性分析，以及（4）执行命名实体识别（NER）以将提取的信息链接到开放数据。结果表明，所提出的工作流程提高了数字化历史报纸的可访问性和可搜索性，也有助于建立数字人文研究的语料库。基于人工智能的方法能够自动提取特征，对相似图像进行聚类，并动态链接相关文章。独创性/价值所提出的工作流程能够自动提取文章，包括检测特定类型的文章，如feuilleton或文学增刊。这对人文学科研究人员来说尤其有价值，因为它提高了这些收藏的可搜索性，并使语料库能够围绕特定主题构建。通过在线工具展示了KBR数字化报纸的文章级访问和搜索能力的提高(https://tw06v072.ugent.be/kbr/)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Documentation INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

4.20

自引率

14.30%

发文量

期刊介绍： The scope of the Journal of Documentation is broadly information sciences, encompassing all of the academic and professional disciplines which deal with recorded information. These include, but are certainly not limited to: ■Information science, librarianship and related disciplines ■Information and knowledge management ■Information and knowledge organisation ■Information seeking and retrieval, and human information behaviour ■Information and digital literacies