欧洲档案门户中的跨语言主题检测和信息检索

IF 2.1 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Marta Musso, Kerstin Arnold, Federico Nanni, Beatrice Cannelli
{"title":"欧洲档案门户中的跨语言主题检测和信息检索","authors":"Marta Musso, Kerstin Arnold, Federico Nanni, Beatrice Cannelli","doi":"10.1145/3494572","DOIUrl":null,"url":null,"abstract":"<p>Archives Portal Europe (APE, www.archivesportaleurope.net) is the portal of European archives, an aggregator that connects on a single research point the catalogues and digitised archival material of all archives in and about Europe. It currently hosts material from more than 30 countries, and from a variety of archival institutions (such as State archives, city archives, university and parish archives, private institutions, and more). It is maintained by the Archives Portal Europe Foundation, an international consortium of State archives and other archival institutions that aim to connect the archival material of single institutions into one digital repository, in order to allow universal access to the archival heritage of Europe, promoting new forms of archival research beyond national or local boundaries. One of the research tools made available by Archives Portal Europe is by topics; however, these are currently maintained manually by the archivists, and the vast amount of archival material ingested in the portal makes it impossible to have a comprehensive body of topics that describe the whole of the APE repository. Archives are traditionally not organised by their subject content, but around the entity (person, organization, body) that created and/or collected the documents in the course of their activities. While this is an undisputed pillar of archival management, the availability of online digital repositories for archival research requires new tools for digital archival research, particularly when different archival traditions from different countries and different types of institutions are merged into a unique research portal. Topic detection becomes a fundamental tool to guide archival research and to allow archives to be accessible to potentially world-wide users, in a situation where national and linguistics barriers blur, or are re-defined. This paper presents the preliminary results and plan for future iterations of an AI tool for automated topic detection in a multi- lingual environment, where human-created taxonomies act as bases for the algorithms to aggregate relevant material around a specific topic. The development is based on supervised machine learning, with a combination of human inputs in different languages, and of the usage of Wikipedia pages to model the relevant vocabulary and entities.</p>","PeriodicalId":54310,"journal":{"name":"ACM Journal on Computing and Cultural Heritage","volume":"261 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"What's in a ? Cross-Lingual Topic Detection & Information Retrieval in Archives Portal Europe\",\"authors\":\"Marta Musso, Kerstin Arnold, Federico Nanni, Beatrice Cannelli\",\"doi\":\"10.1145/3494572\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Archives Portal Europe (APE, www.archivesportaleurope.net) is the portal of European archives, an aggregator that connects on a single research point the catalogues and digitised archival material of all archives in and about Europe. It currently hosts material from more than 30 countries, and from a variety of archival institutions (such as State archives, city archives, university and parish archives, private institutions, and more). It is maintained by the Archives Portal Europe Foundation, an international consortium of State archives and other archival institutions that aim to connect the archival material of single institutions into one digital repository, in order to allow universal access to the archival heritage of Europe, promoting new forms of archival research beyond national or local boundaries. One of the research tools made available by Archives Portal Europe is by topics; however, these are currently maintained manually by the archivists, and the vast amount of archival material ingested in the portal makes it impossible to have a comprehensive body of topics that describe the whole of the APE repository. Archives are traditionally not organised by their subject content, but around the entity (person, organization, body) that created and/or collected the documents in the course of their activities. While this is an undisputed pillar of archival management, the availability of online digital repositories for archival research requires new tools for digital archival research, particularly when different archival traditions from different countries and different types of institutions are merged into a unique research portal. Topic detection becomes a fundamental tool to guide archival research and to allow archives to be accessible to potentially world-wide users, in a situation where national and linguistics barriers blur, or are re-defined. This paper presents the preliminary results and plan for future iterations of an AI tool for automated topic detection in a multi- lingual environment, where human-created taxonomies act as bases for the algorithms to aggregate relevant material around a specific topic. The development is based on supervised machine learning, with a combination of human inputs in different languages, and of the usage of Wikipedia pages to model the relevant vocabulary and entities.</p>\",\"PeriodicalId\":54310,\"journal\":{\"name\":\"ACM Journal on Computing and Cultural Heritage\",\"volume\":\"261 1\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal on Computing and Cultural Heritage\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3494572\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Cultural Heritage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3494572","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

欧洲档案馆门户网站(APE,www.archivesportaleurope.net)是欧洲档案馆的门户网站,是一个将欧洲及欧洲周边所有档案馆的目录和数字化档案资料连接在一个研究点上的聚合器。它目前收录了来自 30 多个国家和各种档案机构(如国家档案馆、城市档案馆、大学和教区档案馆、私人机构等)的资料。该网站由欧洲档案门户基金会维护,该基金会是一个由国家档案馆和其他档案机构组成的国际联盟,旨在将单个机构的档案资料连接到一个数字存储库中,以实现对欧洲档案遗产的普遍访问,促进超越国家或地方界限的新型档案研究。欧洲档案门户网站(Archives Portal Europe)提供的研究工具之一是按主题分类;然而,这些主题目前是由档案管理员手工维护的,而且门户网站收录的档案资料数量庞大,不可能有一套全面的主题来描述整个 APE 存储库。传统上,档案不是按其主题内容组织的,而是围绕在其活动过程中创建和/或收集文件的实体(个人、组织、机构)组织的。虽然这是档案管理无可争议的支柱,但在线数字档案库用于档案研究的可用性需要新的数字档案研究工具,特别是当来自不同国家和不同类型机构的不同档案传统被合并到一个独特的研究门户网站时。在国家和语言障碍变得模糊或被重新定义的情况下,主题检测成为指导档案研究的基本工具,并使档案能够被潜在的全球用户访问。本文介绍了在多语言环境中自动检测主题的人工智能工具的初步成果和未来迭代计划,其中人类创建的分类法是算法围绕特定主题汇总相关材料的基础。该工具的开发以监督机器学习为基础,结合不同语言的人类输入以及维基百科页面的使用,为相关词汇和实体建模。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
What's in a ? Cross-Lingual Topic Detection & Information Retrieval in Archives Portal Europe

Archives Portal Europe (APE, www.archivesportaleurope.net) is the portal of European archives, an aggregator that connects on a single research point the catalogues and digitised archival material of all archives in and about Europe. It currently hosts material from more than 30 countries, and from a variety of archival institutions (such as State archives, city archives, university and parish archives, private institutions, and more). It is maintained by the Archives Portal Europe Foundation, an international consortium of State archives and other archival institutions that aim to connect the archival material of single institutions into one digital repository, in order to allow universal access to the archival heritage of Europe, promoting new forms of archival research beyond national or local boundaries. One of the research tools made available by Archives Portal Europe is by topics; however, these are currently maintained manually by the archivists, and the vast amount of archival material ingested in the portal makes it impossible to have a comprehensive body of topics that describe the whole of the APE repository. Archives are traditionally not organised by their subject content, but around the entity (person, organization, body) that created and/or collected the documents in the course of their activities. While this is an undisputed pillar of archival management, the availability of online digital repositories for archival research requires new tools for digital archival research, particularly when different archival traditions from different countries and different types of institutions are merged into a unique research portal. Topic detection becomes a fundamental tool to guide archival research and to allow archives to be accessible to potentially world-wide users, in a situation where national and linguistics barriers blur, or are re-defined. This paper presents the preliminary results and plan for future iterations of an AI tool for automated topic detection in a multi- lingual environment, where human-created taxonomies act as bases for the algorithms to aggregate relevant material around a specific topic. The development is based on supervised machine learning, with a combination of human inputs in different languages, and of the usage of Wikipedia pages to model the relevant vocabulary and entities.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Journal on Computing and Cultural Heritage
ACM Journal on Computing and Cultural Heritage Arts and Humanities-Conservation
CiteScore
4.60
自引率
8.30%
发文量
90
期刊介绍: ACM Journal on Computing and Cultural Heritage (JOCCH) publishes papers of significant and lasting value in all areas relating to the use of information and communication technologies (ICT) in support of Cultural Heritage. The journal encourages the submission of manuscripts that demonstrate innovative use of technology for the discovery, analysis, interpretation and presentation of cultural material, as well as manuscripts that illustrate applications in the Cultural Heritage sector that challenge the computational technologies and suggest new research opportunities in computer science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信