Temporal record linkage for heterogeneous big data records

IF 4.3 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Reham I. Abdel Monem, Ehab E. Hassanein, Ali Z. El Qutaany
{"title":"Temporal record linkage for heterogeneous big data records","authors":"Reham I. Abdel Monem,&nbsp;Ehab E. Hassanein,&nbsp;Ali Z. El Qutaany","doi":"10.1016/j.eij.2025.100642","DOIUrl":null,"url":null,"abstract":"<div><div>Temporal Record Linkage (TRL) or Temporal Entity Matching (TEM) is the process of identifying records/entities that refer to the same real-world object in different lifetime states. TRL is a well-known problem in different data engineering contexts e.g. data analysis, data warehousing, data mining, and/or machine learning to identify entities denoting the same real-world object over time. Unlike traditional record linkage which considers differences between records of the same entity as contradictions; temporal record linkage considers such differences as normal entity growth over time. Existing frameworks which are limited to, No model, Decay, Disprob, Mixed, and Agreement First Dynamic Second (AFDS) which deal with temporal record linkage achieve high accuracy but with high computation cost. They condition the presence of the time dimension to detect similar entities that refer to the same real-world object. In this research, we present a framework called Tracking Similar Entities in Heterogeneous Temporal Records (TSE-HTR) to track similar entities in heterogeneous, big, low-quality, and temporal data regardless of the presence of the time dimension. It introduces data cleansing and state ranking modules to detect anomalies within similar entities, find the final and accurate set of them, and explain anomalies to the users or domain experts in a comprehensible manner that not only offers increased business intelligence but also opens opportunities for improved solutions. It presents to the user the records of different states of the same real-world object ranked according to different quality measures like completeness, validity, and accuracy. Performance evaluation of the proposed framework against existing frameworks over real and big data shows a great improvement in both effectiveness and efficiency.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"30 ","pages":"Article 100642"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000350","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Temporal Record Linkage (TRL) or Temporal Entity Matching (TEM) is the process of identifying records/entities that refer to the same real-world object in different lifetime states. TRL is a well-known problem in different data engineering contexts e.g. data analysis, data warehousing, data mining, and/or machine learning to identify entities denoting the same real-world object over time. Unlike traditional record linkage which considers differences between records of the same entity as contradictions; temporal record linkage considers such differences as normal entity growth over time. Existing frameworks which are limited to, No model, Decay, Disprob, Mixed, and Agreement First Dynamic Second (AFDS) which deal with temporal record linkage achieve high accuracy but with high computation cost. They condition the presence of the time dimension to detect similar entities that refer to the same real-world object. In this research, we present a framework called Tracking Similar Entities in Heterogeneous Temporal Records (TSE-HTR) to track similar entities in heterogeneous, big, low-quality, and temporal data regardless of the presence of the time dimension. It introduces data cleansing and state ranking modules to detect anomalies within similar entities, find the final and accurate set of them, and explain anomalies to the users or domain experts in a comprehensible manner that not only offers increased business intelligence but also opens opportunities for improved solutions. It presents to the user the records of different states of the same real-world object ranked according to different quality measures like completeness, validity, and accuracy. Performance evaluation of the proposed framework against existing frameworks over real and big data shows a great improvement in both effectiveness and efficiency.
异构大数据记录的时态记录联动
时间记录关联(TRL)或时间实体匹配(TEM)是识别在不同生命周期状态下指向同一现实世界对象的记录/实体的过程。TRL 是不同数据工程(如数据分析、数据仓库、数据挖掘和/或机器学习)中的一个众所周知的问题,用于识别表示同一真实世界对象的实体。传统的记录关联将同一实体记录之间的差异视为矛盾,而时态记录关联则将这种差异视为实体随时间的正常增长。现有的处理时间记录关联的框架仅限于无模型、衰减、Disprob、混合和第一动态第二协议(AFDS),这些框架都能达到很高的精确度,但计算成本也很高。它们以时间维度的存在为条件,来检测指向同一现实世界对象的相似实体。在这项研究中,我们提出了一个名为 "异构时态记录中的相似实体追踪(TSE-HTR)"的框架,用于在异构、大数据、低质量数据和时态数据中追踪相似实体,而不考虑时间维度的存在。它引入了数据清洗和状态排序模块,以检测相似实体中的异常情况,找到最终准确的相似实体集,并以可理解的方式向用户或领域专家解释异常情况,这不仅提高了商业智能,还为改进解决方案提供了机会。它根据不同的质量衡量标准,如完整性、有效性和准确性,向用户展示同一真实世界对象的不同状态记录。根据现有框架对真实数据和海量数据进行的性能评估显示,拟议框架在有效性和效率方面都有很大改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Egyptian Informatics Journal
Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research
CiteScore
11.10
自引率
1.90%
发文量
59
审稿时长
110 days
期刊介绍: The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信