NEAT——考古文本中的命名实体：术语提取和分类的语义方法

IF 0.7 3区文学 0 HUMANITIES, MULTIDISCIPLINARY

Digital Scholarship in the Humanities Pub Date : 2023-04-13 DOI:10.1093/llc/fqad017

Maria Pia di Buono, Gennaro Nolano, J. Monti

{"title":"NEAT——考古文本中的命名实体：术语提取和分类的语义方法","authors":"Maria Pia di Buono, Gennaro Nolano, J. Monti","doi":"10.1093/llc/fqad017","DOIUrl":null,"url":null,"abstract":"\n The lack of annotated datasets affects the development of Natural Language Processing applications and heavily impacts the access to textual data, in particular for specific domains and specific languages. In this paper, we propose a methodology to annotate texts concerning domain-specific knowledge, to provide a reliable source of data for the task of Named Entity Recognition (NER) in the domain of archaeology for the Italian laguage. This method integrates syntactic and semantic information from several structured sources to annotate entities’ mentions in unstructured texts. Furthermore, we make use of an ontology to label entities with the specific type they refer to. By using a corpus made up of item descriptions from Europeana’s Archaeology Collection, we first test our proposed methodology on a mock dataset composed of 1,000 texts. After several steps of improvements, we use the final process to create a complete dataset composed of 5,000 descriptions. The resulting dataset, Named Entities in Archaeological Texts has a total of 41,002 spans of texts annotated with their domain-specific entity classification according to the CIDOC Conceptual Reference Model.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NEAT—Named Entities in Archaeological Texts: A semantic approach to term extraction and classification\",\"authors\":\"Maria Pia di Buono, Gennaro Nolano, J. Monti\",\"doi\":\"10.1093/llc/fqad017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n The lack of annotated datasets affects the development of Natural Language Processing applications and heavily impacts the access to textual data, in particular for specific domains and specific languages. In this paper, we propose a methodology to annotate texts concerning domain-specific knowledge, to provide a reliable source of data for the task of Named Entity Recognition (NER) in the domain of archaeology for the Italian laguage. This method integrates syntactic and semantic information from several structured sources to annotate entities’ mentions in unstructured texts. Furthermore, we make use of an ontology to label entities with the specific type they refer to. By using a corpus made up of item descriptions from Europeana’s Archaeology Collection, we first test our proposed methodology on a mock dataset composed of 1,000 texts. After several steps of improvements, we use the final process to create a complete dataset composed of 5,000 descriptions. The resulting dataset, Named Entities in Archaeological Texts has a total of 41,002 spans of texts annotated with their domain-specific entity classification according to the CIDOC Conceptual Reference Model.\",\"PeriodicalId\":45315,\"journal\":{\"name\":\"Digital Scholarship in the Humanities\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-04-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Scholarship in the Humanities\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1093/llc/fqad017\",\"RegionNum\":3,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"HUMANITIES, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqad017","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

缺乏注释数据集影响了自然语言处理应用程序的开发，并严重影响了对文本数据的访问，尤其是对特定领域和特定语言的访问。在本文中，我们提出了一种方法来注释与领域特定知识有关的文本，为意大利语言考古领域的命名实体识别（NER）任务提供可靠的数据来源。该方法集成了来自多个结构化来源的句法和语义信息，以注释非结构化文本中实体的提及。此外，我们使用本体论来标记实体所指的特定类型。通过使用由欧洲考古收藏中的物品描述组成的语料库，我们首先在由1000个文本组成的模拟数据集上测试了我们提出的方法。经过几个步骤的改进，我们使用最终流程创建了一个由5000个描述组成的完整数据集。由此产生的数据集“考古文本中的命名实体”共有41002个跨度的文本，根据CIDOC概念参考模型，用其特定领域的实体分类进行了注释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NEAT—Named Entities in Archaeological Texts: A semantic approach to term extraction and classification

The lack of annotated datasets affects the development of Natural Language Processing applications and heavily impacts the access to textual data, in particular for specific domains and specific languages. In this paper, we propose a methodology to annotate texts concerning domain-specific knowledge, to provide a reliable source of data for the task of Named Entity Recognition (NER) in the domain of archaeology for the Italian laguage. This method integrates syntactic and semantic information from several structured sources to annotate entities’ mentions in unstructured texts. Furthermore, we make use of an ontology to label entities with the specific type they refer to. By using a corpus made up of item descriptions from Europeana’s Archaeology Collection, we first test our proposed methodology on a mock dataset composed of 1,000 texts. After several steps of improvements, we use the final process to create a complete dataset composed of 5,000 descriptions. The resulting dataset, Named Entities in Archaeological Texts has a total of 41,002 spans of texts annotated with their domain-specific entity classification according to the CIDOC Conceptual Reference Model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Scholarship in the Humanities Multiple-

CiteScore

1.80

自引率

25.00%

发文量

期刊介绍： DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.