{"title":"Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique","authors":"Fumihiro Sakahira, Yuji Yamaguchi, Takao Terano","doi":"10.20965/jaciii.2023.p0394","DOIUrl":null,"url":null,"abstract":"In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.","PeriodicalId":45921,"journal":{"name":"Journal of Advanced Computational Intelligence and Intelligent Informatics","volume":"136 1","pages":"394-403"},"PeriodicalIF":0.7000,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Computational Intelligence and Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20965/jaciii.2023.p0394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1
Abstract
In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.