Two Probabilistic Models for Quick Dissimilarity Detection of Big Binary Data

WSEAS Transactions on Mathematics archive Pub Date : 2021-05-19 DOI:10.37394/23206.2021.20.25

Adnan A. Y. Mustafa, Safat Kuwait

{"title":"Two Probabilistic Models for Quick Dissimilarity Detection of Big Binary Data","authors":"Adnan A. Y. Mustafa, Safat Kuwait","doi":"10.37394/23206.2021.20.25","DOIUrl":null,"url":null,"abstract":"The task of data matching arises frequently in many aspects of science. It can become a time consuming process when the data is being matched to a huge database consisting of thousands of possible candidates, and the goal is to find the best match. It can be even more time consuming if the data are big (> 100 MB). One approach to reducing the time complexity of the matching process is to reduce the search space by introducing a pre-matching stage, where very dissimilar data are quickly removed. In this paper we focus our attention to matching big binary data. In this paper we present two probabilistic models for the quick dissimilarity detection of big binary data: the Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (PMQDD) and the Inverse-equality Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (IPMQDD). Dissimilarity detection between binary vectors can be accomplished quickly by random element mapping. The detection technique is not a function of data size and hence dissimilarity detection is performed quickly. We treat binary data as binary vectors, and hence any binary data of any size and dimension is treated as a binary vector. PMQDD is based on a binary similarity distance that does not recognize data and its exact inverse as containing the same pattern and hence considers them to be different. However, in some applications a specific data and its inverse, are regarded as the same pattern, and thus should be identified as being the same; IPMQDD is able to identify such cases, as it is based on a similarity distance that does not distinguish between data and its inverse instance as being dissimilar. We present a comparative analysis between PMQDD and IPMQDD, as well as their similarity distances. We present an application of the models to a set of object models, that show the effectiveness and power of these models.. Key-Words: Big data, binary data, binary vector, matching, size invariance, probabilistic model, dissimilarity detection, pattern recognition, model matching Received: March 27, 2021. Revised: April 30, 2021. Accepted: May 6, 2021. Published: May 19, 2021.","PeriodicalId":112268,"journal":{"name":"WSEAS Transactions on Mathematics archive","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WSEAS Transactions on Mathematics archive","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37394/23206.2021.20.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The task of data matching arises frequently in many aspects of science. It can become a time consuming process when the data is being matched to a huge database consisting of thousands of possible candidates, and the goal is to find the best match. It can be even more time consuming if the data are big (> 100 MB). One approach to reducing the time complexity of the matching process is to reduce the search space by introducing a pre-matching stage, where very dissimilar data are quickly removed. In this paper we focus our attention to matching big binary data. In this paper we present two probabilistic models for the quick dissimilarity detection of big binary data: the Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (PMQDD) and the Inverse-equality Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (IPMQDD). Dissimilarity detection between binary vectors can be accomplished quickly by random element mapping. The detection technique is not a function of data size and hence dissimilarity detection is performed quickly. We treat binary data as binary vectors, and hence any binary data of any size and dimension is treated as a binary vector. PMQDD is based on a binary similarity distance that does not recognize data and its exact inverse as containing the same pattern and hence considers them to be different. However, in some applications a specific data and its inverse, are regarded as the same pattern, and thus should be identified as being the same; IPMQDD is able to identify such cases, as it is based on a similarity distance that does not distinguish between data and its inverse instance as being dissimilar. We present a comparative analysis between PMQDD and IPMQDD, as well as their similarity distances. We present an application of the models to a set of object models, that show the effectiveness and power of these models.. Key-Words: Big data, binary data, binary vector, matching, size invariance, probabilistic model, dissimilarity detection, pattern recognition, model matching Received: March 27, 2021. Revised: April 30, 2021. Accepted: May 6, 2021. Published: May 19, 2021.

查看原文本刊更多论文

大二进制数据快速不相似度检测的两种概率模型

数据匹配的任务经常出现在科学的许多方面。当数据与由数千个可能的候选数据组成的庞大数据库进行匹配时，目标是找到最佳匹配，这可能会成为一个耗时的过程。如果数据很大(> 100 MB)，则可能会花费更多时间。减少匹配过程时间复杂度的一种方法是通过引入预匹配阶段来减少搜索空间，在预匹配阶段中，非常不相似的数据会被快速删除。本文主要研究大二进制数据的匹配问题。本文提出了二值数据快速不相似度检测的两种概率模型:二值向量快速不相似度检测概率模型(PMQDD)和二值向量快速不相似度检测逆等概率模型(IPMQDD)。通过随机元素映射可以快速完成二值向量间的不相似检测。检测技术不是数据大小的函数，因此差异检测可以快速执行。我们将二进制数据视为二进制向量，因此任何大小和维数的二进制数据都被视为二进制向量。PMQDD基于二元相似距离，它不能识别数据及其精确逆包含相同的模式，因此认为它们是不同的。然而，在某些应用中，一个特定的数据和它的逆数据被认为是相同的模式，因此应该被识别为相同的;IPMQDD能够识别这样的情况，因为它基于相似距离，不会将数据与其反向实例区分为不相似。我们提出了PMQDD和IPMQDD之间的比较分析，以及它们的相似距离。我们将这些模型应用到一组对象模型中，显示了这些模型的有效性和功能。关键词:大数据，二值数据，二值向量，匹配，大小不变性，概率模型，差异检测，模式识别，模型匹配修订日期:2021年4月30日。录用日期:2021年5月6日。发布日期:2021年5月19日。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

WSEAS Transactions on Mathematics archive

自引率

0.00%

发文量