Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique

IF 0.8 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Advanced Computational Intelligence and Intelligent Informatics Pub Date : 2023-05-20 DOI:10.20965/jaciii.2023.p0394

Fumihiro Sakahira, Yuji Yamaguchi, Takao Terano

{"title":"Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique","authors":"Fumihiro Sakahira, Yuji Yamaguchi, Takao Terano","doi":"10.20965/jaciii.2023.p0394","DOIUrl":null,"url":null,"abstract":"In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.","PeriodicalId":45921,"journal":{"name":"Journal of Advanced Computational Intelligence and Intelligent Informatics","volume":"136 1","pages":"394-403"},"PeriodicalIF":0.8000,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Computational Intelligence and Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20965/jaciii.2023.p0394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.

查看原文本刊更多论文

利用自然语言处理技术从发掘报告中了解考古遗址的文化相似性

在本研究中，我们将自然语言处理(NLP)技术应用于埋藏文物的挖掘报告文本，计算报告之间的相似程度，以确定具有高度相似度的考古遗址。具体而言，我们验证了这些遗址挖掘报告中句子嵌入的相似性是否与现有分类一致。使用了现有考古研究论文中分类的四个考古遗址。为了验证，使用了来自四个地点的128份挖掘报告;使用Doc2Vec获取句子嵌入。我们得到了以下结果:1)在将NLP应用于挖掘报告中以确定考古遗址的相似性时，将每个遗址的文本合并为一个文件然后进行处理比在挖掘报告的单独卷中进行处理更可取。2)基于句子嵌入的Doc2Vec挖掘报告相似度比术语频率-逆文档频率(TF-IDF)更符合考古遗址特征分类。3)在针对特定时期时，专门针对相关时期文本的句子嵌入与该特定时期的文物和结构遗迹的考古遗址特征分类是一致的。(4)当以特定时期为目标时，通过句嵌入的加性组合性获得该时期的专属句嵌入，可以根据该时期的文物和结构遗迹对考古遗址的特征进行分类。因此，基于NLP的文本相似性可以反映考古遗址的相似性。这甚至适用于包含拼写不一致、光学字符阅读器错误识别和乱码的挖掘报告。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Advanced Computational Intelligence and Intelligent Informatics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

1.50

自引率

14.30%

发文量

期刊介绍： JACIII focuses on advanced computational intelligence and intelligent informatics. The topics include, but are not limited to; Fuzzy logic, Fuzzy control, Neural Networks, GA and Evolutionary Computation, Hybrid Systems, Adaptation and Learning Systems, Distributed Intelligent Systems, Network systems, Multi-media, Human interface, Biologically inspired evolutionary systems, Artificial life, Chaos, Complex systems, Fractals, Robotics, Medical applications, Pattern recognition, Virtual reality, Wavelet analysis, Scientific applications, Industrial applications, and Artistic applications.