{"title":"Two Probabilistic Models for Quick Dissimilarity Detection of Big Binary Data","authors":"Adnan A. Y. Mustafa, Safat Kuwait","doi":"10.37394/23206.2021.20.25","DOIUrl":null,"url":null,"abstract":"The task of data matching arises frequently in many aspects of science. It can become a time consuming process when the data is being matched to a huge database consisting of thousands of possible candidates, and the goal is to find the best match. It can be even more time consuming if the data are big (> 100 MB). One approach to reducing the time complexity of the matching process is to reduce the search space by introducing a pre-matching stage, where very dissimilar data are quickly removed. In this paper we focus our attention to matching big binary data. In this paper we present two probabilistic models for the quick dissimilarity detection of big binary data: the Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (PMQDD) and the Inverse-equality Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (IPMQDD). Dissimilarity detection between binary vectors can be accomplished quickly by random element mapping. The detection technique is not a function of data size and hence dissimilarity detection is performed quickly. We treat binary data as binary vectors, and hence any binary data of any size and dimension is treated as a binary vector. PMQDD is based on a binary similarity distance that does not recognize data and its exact inverse as containing the same pattern and hence considers them to be different. However, in some applications a specific data and its inverse, are regarded as the same pattern, and thus should be identified as being the same; IPMQDD is able to identify such cases, as it is based on a similarity distance that does not distinguish between data and its inverse instance as being dissimilar. We present a comparative analysis between PMQDD and IPMQDD, as well as their similarity distances. We present an application of the models to a set of object models, that show the effectiveness and power of these models.. Key-Words: Big data, binary data, binary vector, matching, size invariance, probabilistic model, dissimilarity detection, pattern recognition, model matching Received: March 27, 2021. Revised: April 30, 2021. Accepted: May 6, 2021. Published: May 19, 2021.","PeriodicalId":112268,"journal":{"name":"WSEAS Transactions on Mathematics archive","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WSEAS Transactions on Mathematics archive","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37394/23206.2021.20.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The task of data matching arises frequently in many aspects of science. It can become a time consuming process when the data is being matched to a huge database consisting of thousands of possible candidates, and the goal is to find the best match. It can be even more time consuming if the data are big (> 100 MB). One approach to reducing the time complexity of the matching process is to reduce the search space by introducing a pre-matching stage, where very dissimilar data are quickly removed. In this paper we focus our attention to matching big binary data. In this paper we present two probabilistic models for the quick dissimilarity detection of big binary data: the Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (PMQDD) and the Inverse-equality Probabilistic Model for Quick Dissimilarity Detection of Binary vectors (IPMQDD). Dissimilarity detection between binary vectors can be accomplished quickly by random element mapping. The detection technique is not a function of data size and hence dissimilarity detection is performed quickly. We treat binary data as binary vectors, and hence any binary data of any size and dimension is treated as a binary vector. PMQDD is based on a binary similarity distance that does not recognize data and its exact inverse as containing the same pattern and hence considers them to be different. However, in some applications a specific data and its inverse, are regarded as the same pattern, and thus should be identified as being the same; IPMQDD is able to identify such cases, as it is based on a similarity distance that does not distinguish between data and its inverse instance as being dissimilar. We present a comparative analysis between PMQDD and IPMQDD, as well as their similarity distances. We present an application of the models to a set of object models, that show the effectiveness and power of these models.. Key-Words: Big data, binary data, binary vector, matching, size invariance, probabilistic model, dissimilarity detection, pattern recognition, model matching Received: March 27, 2021. Revised: April 30, 2021. Accepted: May 6, 2021. Published: May 19, 2021.