IMPROVEMENT OF CLUSTERING ALGORITHMS BY IMPLEMENTATION OF SPELLING BASED RANKING

IF 0.4 Q4 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

IADIS-International Journal on Computer Science and Information Systems Pub Date : 2021-11-01 DOI:10.33965/ijcsis_2021160204

Eva Bryer, Theppatorn Rhujittawiwat, J. Rose, Colin Wilder

{"title":"IMPROVEMENT OF CLUSTERING ALGORITHMS BY IMPLEMENTATION OF SPELLING BASED RANKING","authors":"Eva Bryer, Theppatorn Rhujittawiwat, J. Rose, Colin Wilder","doi":"10.33965/ijcsis_2021160204","DOIUrl":null,"url":null,"abstract":"The goal of this paper is to modify an existing clustering algorithm with the use of the Hunspell spell checker to specialize it for the use of cleaning early modern European book title data. Duplicate and corrupted data is a constant concern for data analysis, and clustering has been identified to be a robust tool for normalizing and cleaning data such as ours. In particular, our data comprises over 5 million books published in European languages between 1500 and 1800 in the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, as each library individually catalogued their records, many duplicative and inaccurate records exist in the data set. Additionally, each language evolved over the 300-year period we are studying, and as such many of the words had their spellings altered. Without cleaning and normalizing this data, it would be difficult to find coherent trends, as much of the data may be missed in the query. In previous research, we have identified the use of Prediction by Partial Matching to provide the most increase in base accuracy when applied to dirty data of similar construct to our data set. However, there are many cases in which the correct book title may not be the most common, either when only two values exist in a cluster, or the dirty title exists in more records. In these cases, a language agnostic clustering algorithm would normalize the incorrect title and lower the overall accuracy of the data set. By implementing the Hunspell spell checker into the clustering algorithm, using it to rank clusters by the number of words not found in their dictionary, we can drastically lower the cases of this occurring. Indeed, this ranking algorithm proved to increase the overall accuracy of the clustered data by as much as 25% over the unmodified Prediction by Partial Matching algorithm.","PeriodicalId":41878,"journal":{"name":"IADIS-International Journal on Computer Science and Information Systems","volume":"33 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IADIS-International Journal on Computer Science and Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33965/ijcsis_2021160204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The goal of this paper is to modify an existing clustering algorithm with the use of the Hunspell spell checker to specialize it for the use of cleaning early modern European book title data. Duplicate and corrupted data is a constant concern for data analysis, and clustering has been identified to be a robust tool for normalizing and cleaning data such as ours. In particular, our data comprises over 5 million books published in European languages between 1500 and 1800 in the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, as each library individually catalogued their records, many duplicative and inaccurate records exist in the data set. Additionally, each language evolved over the 300-year period we are studying, and as such many of the words had their spellings altered. Without cleaning and normalizing this data, it would be difficult to find coherent trends, as much of the data may be missed in the query. In previous research, we have identified the use of Prediction by Partial Matching to provide the most increase in base accuracy when applied to dirty data of similar construct to our data set. However, there are many cases in which the correct book title may not be the most common, either when only two values exist in a cluster, or the dirty title exists in more records. In these cases, a language agnostic clustering algorithm would normalize the incorrect title and lower the overall accuracy of the data set. By implementing the Hunspell spell checker into the clustering algorithm, using it to rank clusters by the number of words not found in their dictionary, we can drastically lower the cases of this occurring. Indeed, this ranking algorithm proved to increase the overall accuracy of the clustered data by as much as 25% over the unmodified Prediction by Partial Matching algorithm.

查看原文本刊更多论文

基于拼写排序的聚类算法改进

本文的目标是使用Hunspell拼写检查器修改现有的聚类算法，使其专门化，用于清理早期现代欧洲图书标题数据。重复和损坏的数据一直是数据分析关注的问题，而聚类已经被认为是一种用于规范化和清理数据的强大工具，比如我们的数据。特别是，我们的数据包括1500年至1800年间以机器可读编目(MARC)数据格式出版的500多万本欧洲语言的图书，这些图书来自123个国家的17,983家图书馆。然而，由于每个图书馆单独编目，数据集中存在许多重复和不准确的记录。此外，每种语言在我们研究的300年期间都在演变，因此许多单词的拼写也发生了变化。如果不对这些数据进行清理和规范化，就很难找到一致的趋势，因为在查询中可能会遗漏很多数据。在之前的研究中，我们已经确定了部分匹配预测的使用，当应用于与我们的数据集结构相似的脏数据时，可以提供最大的基础精度提高。然而，在很多情况下，正确的图书标题可能不是最常见的，要么是在一个集群中只有两个值，要么是脏标题存在于更多记录中。在这些情况下，与语言无关的聚类算法将对不正确的标题进行规范化，并降低数据集的总体准确性。通过将Hunspell拼写检查器实现到聚类算法中，使用它根据字典中未找到的单词数量对聚类进行排序，我们可以大大降低这种情况的发生。事实上，这种排序算法比未修改的部分匹配预测算法提高了聚类数据的总体精度达25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IADIS-International Journal on Computer Science and Information Systems COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-

自引率

0.00%

发文量