Reham I. Abdel Monem, Ehab E. Hassanein, Ali Z. El Qutaany
{"title":"Temporal record linkage for heterogeneous big data records","authors":"Reham I. Abdel Monem, Ehab E. Hassanein, Ali Z. El Qutaany","doi":"10.1016/j.eij.2025.100642","DOIUrl":null,"url":null,"abstract":"<div><div>Temporal Record Linkage (TRL) or Temporal Entity Matching (TEM) is the process of identifying records/entities that refer to the same real-world object in different lifetime states. TRL is a well-known problem in different data engineering contexts e.g. data analysis, data warehousing, data mining, and/or machine learning to identify entities denoting the same real-world object over time. Unlike traditional record linkage which considers differences between records of the same entity as contradictions; temporal record linkage considers such differences as normal entity growth over time. Existing frameworks which are limited to, No model, Decay, Disprob, Mixed, and Agreement First Dynamic Second (AFDS) which deal with temporal record linkage achieve high accuracy but with high computation cost. They condition the presence of the time dimension to detect similar entities that refer to the same real-world object. In this research, we present a framework called Tracking Similar Entities in Heterogeneous Temporal Records (TSE-HTR) to track similar entities in heterogeneous, big, low-quality, and temporal data regardless of the presence of the time dimension. It introduces data cleansing and state ranking modules to detect anomalies within similar entities, find the final and accurate set of them, and explain anomalies to the users or domain experts in a comprehensible manner that not only offers increased business intelligence but also opens opportunities for improved solutions. It presents to the user the records of different states of the same real-world object ranked according to different quality measures like completeness, validity, and accuracy. Performance evaluation of the proposed framework against existing frameworks over real and big data shows a great improvement in both effectiveness and efficiency.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"30 ","pages":"Article 100642"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000350","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Temporal Record Linkage (TRL) or Temporal Entity Matching (TEM) is the process of identifying records/entities that refer to the same real-world object in different lifetime states. TRL is a well-known problem in different data engineering contexts e.g. data analysis, data warehousing, data mining, and/or machine learning to identify entities denoting the same real-world object over time. Unlike traditional record linkage which considers differences between records of the same entity as contradictions; temporal record linkage considers such differences as normal entity growth over time. Existing frameworks which are limited to, No model, Decay, Disprob, Mixed, and Agreement First Dynamic Second (AFDS) which deal with temporal record linkage achieve high accuracy but with high computation cost. They condition the presence of the time dimension to detect similar entities that refer to the same real-world object. In this research, we present a framework called Tracking Similar Entities in Heterogeneous Temporal Records (TSE-HTR) to track similar entities in heterogeneous, big, low-quality, and temporal data regardless of the presence of the time dimension. It introduces data cleansing and state ranking modules to detect anomalies within similar entities, find the final and accurate set of them, and explain anomalies to the users or domain experts in a comprehensible manner that not only offers increased business intelligence but also opens opportunities for improved solutions. It presents to the user the records of different states of the same real-world object ranked according to different quality measures like completeness, validity, and accuracy. Performance evaluation of the proposed framework against existing frameworks over real and big data shows a great improvement in both effectiveness and efficiency.
期刊介绍:
The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.