大数据处理环境下模糊连接计算的相似度算法

Journal of Computer Science and Cybernetics Pub Date : 2022-06-20 DOI:10.15625/1813-9663/17589

Anh-Cang Phan, Thuong-Cang Phan

{"title":"大数据处理环境下模糊连接计算的相似度算法","authors":"Anh-Cang Phan, Thuong-Cang Phan","doi":"10.15625/1813-9663/17589","DOIUrl":null,"url":null,"abstract":"Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.","PeriodicalId":15444,"journal":{"name":"Journal of Computer Science and Cybernetics","volume":"187 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SIMILARITY ALGORITHMS FOR FUZZY JOIN COMPUTATION IN BIG DATA PROCESSING ENVIRONMENT\",\"authors\":\"Anh-Cang Phan, Thuong-Cang Phan\",\"doi\":\"10.15625/1813-9663/17589\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.\",\"PeriodicalId\":15444,\"journal\":{\"name\":\"Journal of Computer Science and Cybernetics\",\"volume\":\"187 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Science and Cybernetics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15625/1813-9663/17589\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15625/1813-9663/17589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大数据处理吸引了许多研究人员的兴趣，以处理大规模数据集并提取有用的信息来支持和提供决策。最大的挑战之一是查询大型数据集的问题。如果使用相似查询而不是精确匹配查询，情况会变得更加复杂。模糊连接操作是相似度查询和大数据分析中常用的一种典型操作。目前，关于该问题的研究很少，这对有效提高大数据查询操作的努力造成了很大的障碍。因此，本研究概述了模糊连接的相似度算法，其中连接键属性处的数据在模糊阈值内可能存在细微差异。我们分析了Hamming、Levenshtein、LCS、Jaccard、Jaro和Jaro - Winkler等6种相似度算法，通过输出富集、假阳性/阴性和算法的处理时间三个标准来展示这些算法之间的差异。在流行的大数据处理平台Spark环境中，实现了模糊连接算法的实验。算法分为两组进行评估:第1组(Hamming, Levenshtein和LCS)和第2组(Jaccard, Jaro和Jaro - Winkler)。对于前者，Levenshtein在输出丰富性、结果集的高准确性(假阳性/假阴性)和可接受的处理时间方面比其他两种算法有优势。在这封信中，Jaccard算法被认为是同时考虑三个标准的最差算法，而Jaro - Winkler算法在结果集中具有更丰富的输出和更高的精度。本研究对相似度算法的概述将有助于用户选择最适合自己问题的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SIMILARITY ALGORITHMS FOR FUZZY JOIN COMPUTATION IN BIG DATA PROCESSING ENVIRONMENT

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer Science and Cybernetics

自引率

0.00%

发文量