Accelerating similarity-based model matching using dual hashing

IF 3.2 3区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Software and Systems Modeling Pub Date : 2024-04-29 DOI:10.1007/s10270-024-01173-1

Xiao He, Yi Liu, Huihong He

{"title":"Accelerating similarity-based model matching using dual hashing","authors":"Xiao He, Yi Liu, Huihong He","doi":"10.1007/s10270-024-01173-1","DOIUrl":null,"url":null,"abstract":"Similarity-based model matching is the cornerstone of model versioning. It pairs model elements based on a distance metric (e.g., edit distance). However, calculating the distances between elements is computationally expensive. Consequently, a similarity-based matcher typically suffers from performance issues when the model size increases. Based on observation, there are two main causes of the high computation cost: (1) when matching an element p, the matcher calculates the distance between p and every candidate element q, despite the obvious dissimilarity between p and q; (2) the matcher always calculates the distance between p and \\(q'\\), even though q and \\(q'\\) are very similar and the distance between p and q is already known. This paper proposes a dual-hash-based approach, which employs two entirely different hashing techniques—similarity-preserving hashing and integrity-based hashing—to accelerate similarity-based model matching. With similarity-preserving hashing, our approach can quickly filter out the dissimilar candidate elements according to their similarity hashes computed using our similarity-preserving hash function, which maps an element to a 64-bit binary hash. With integrity-based hashing, our approach can cache and reuse computed distance values by associating them with the checksums of model elements. We also propose an index structure to facilitate hash-based model matching. Our approach has been implemented and integrated into EMF Compare. We evaluate our approach using open-source Ecore and UML models. The results show that our hash function is effective in preserving the similarity between model elements and our matching approach reduces time costs by 20–88% while assuring the matching results consistent with EMF Compare.","PeriodicalId":49507,"journal":{"name":"Software and Systems Modeling","volume":"54 1","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Software and Systems Modeling","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10270-024-01173-1","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Similarity-based model matching is the cornerstone of model versioning. It pairs model elements based on a distance metric (e.g., edit distance). However, calculating the distances between elements is computationally expensive. Consequently, a similarity-based matcher typically suffers from performance issues when the model size increases. Based on observation, there are two main causes of the high computation cost: (1) when matching an element p, the matcher calculates the distance between p and every candidate element q, despite the obvious dissimilarity between p and q; (2) the matcher always calculates the distance between p and \(q'\), even though q and \(q'\) are very similar and the distance between p and q is already known. This paper proposes a dual-hash-based approach, which employs two entirely different hashing techniques—similarity-preserving hashing and integrity-based hashing—to accelerate similarity-based model matching. With similarity-preserving hashing, our approach can quickly filter out the dissimilar candidate elements according to their similarity hashes computed using our similarity-preserving hash function, which maps an element to a 64-bit binary hash. With integrity-based hashing, our approach can cache and reuse computed distance values by associating them with the checksums of model elements. We also propose an index structure to facilitate hash-based model matching. Our approach has been implemented and integrated into EMF Compare. We evaluate our approach using open-source Ecore and UML models. The results show that our hash function is effective in preserving the similarity between model elements and our matching approach reduces time costs by 20–88% while assuring the matching results consistent with EMF Compare.

Abstract Image

查看原文本刊更多论文

利用双重散列加速基于相似性的模型匹配

基于相似性的模型匹配是模型版本化的基石。它根据距离度量（如编辑距离）对模型元素进行配对。然而，计算元素之间的距离需要耗费大量计算资源。因此，当模型规模增大时，基于相似性的匹配器通常会出现性能问题。根据观察，计算成本高的主要原因有两个：（1）当匹配一个元素 p 时，匹配器会计算 p 和每个候选元素 q 之间的距离，尽管 p 和 q 之间有明显的不相似性；（2）匹配器总是计算 p 和 \(q'\)之间的距离，尽管 q 和 \(q'\)非常相似，并且 p 和 q 之间的距离已经已知。本文提出了一种基于双重散列的方法，它采用了两种完全不同的散列技术--保存相似性散列和基于完整性散列--来加速基于相似性的模型匹配。通过相似性保留哈希算法，我们的方法可以根据使用我们的相似性保留哈希函数计算出的相似性哈希值，快速筛选出不相似的候选元素，该哈希函数将元素映射为 64 位二进制哈希值。通过基于完整性的哈希算法，我们的方法可以将计算出的距离值与模型元素的校验和相关联，从而实现缓存和重用。我们还提出了一种索引结构，以促进基于哈希值的模型匹配。我们的方法已经实现并集成到 EMF Compare 中。我们使用开源 Ecore 和 UML 模型对我们的方法进行了评估。结果表明，我们的哈希函数能有效地保持模型元素之间的相似性，我们的匹配方法能减少 20-88% 的时间成本，同时确保匹配结果与 EMF Compare 一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Software and Systems Modeling 工程技术-计算机：软件工程

CiteScore

6.00

自引率

20.00%

发文量

104

审稿时长

>12 weeks

期刊介绍： We invite authors to submit papers that discuss and analyze research challenges and experiences pertaining to software and system modeling languages, techniques, tools, practices and other facets. The following are some of the topic areas that are of special interest, but the journal publishes on a wide range of software and systems modeling concerns: Domain-specific models and modeling standards; Model-based testing techniques; Model-based simulation techniques; Formal syntax and semantics of modeling languages such as the UML; Rigorous model-based analysis; Model composition, refinement and transformation; Software Language Engineering; Modeling Languages in Science and Engineering; Language Adaptation and Composition; Metamodeling techniques; Measuring quality of models and languages; Ontological approaches to model engineering; Generating test and code artifacts from models; Model synthesis; Methodology; Model development tool environments; Modeling Cyberphysical Systems; Data intensive modeling; Derivation of explicit models from data; Case studies and experience reports with significant modeling lessons learned; Comparative analyses of modeling languages and techniques; Scientific assessment of modeling practices