ERABQS: entity resolution based on active machine learning and balancing query strategy

IF 3.4 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Intelligent Information Systems Pub Date : 2024-03-26 DOI:10.1007/s10844-024-00853-0

Jabrane Mourad, Tabbaa Hiba, Rochd Yassir, Hafidi Imad

{"title":"ERABQS: entity resolution based on active machine learning and balancing query strategy","authors":"Jabrane Mourad, Tabbaa Hiba, Rochd Yassir, Hafidi Imad","doi":"10.1007/s10844-024-00853-0","DOIUrl":null,"url":null,"abstract":"Entity Resolution (ER) is a crucial process in the field of data management and integration. The primary goal of ER is to identify different profiles (or records) that refer to the same real-world entity across databases. The challenging problem is that labeling a large sample of profiles can be very expensive and time-consuming. Active Machine Learning (ActiveML) addresses this issue by selecting the most representative or informative profiles pairs to be labeled. The informativeness is determined by the capacity to diminish the uncertainty of the model. Conversely, representativeness evaluates whether a selected instance effectively reflects the overall input patterns of unlabeled data. Traditional ActiveML techniques typically rely on one strategy, Which may severely restrict the performance of the ActiveML process and lead to slow convergence. Especially in ER problems with a lack of initial training data. In this paper, we overcame this issue by inventing an approach for balancing the two above strategies. The implemented solution named EBEES (Epsilon-based Balancing Exploration and Exploitation Strategy), Which contains two variations: Adaptive-\\(\\epsilon \\) and \\(\\epsilon \\)-decreasing. We evaluated the EBEES on twelve datasets. Comparing the EBEES strategy against the state-of-the-art methods, without an initial training data, showed an enhanced performance in terms of F1-score, model stability, and rapid convergence.","PeriodicalId":56119,"journal":{"name":"Journal of Intelligent Information Systems","volume":"63 1","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10844-024-00853-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Entity Resolution (ER) is a crucial process in the field of data management and integration. The primary goal of ER is to identify different profiles (or records) that refer to the same real-world entity across databases. The challenging problem is that labeling a large sample of profiles can be very expensive and time-consuming. Active Machine Learning (ActiveML) addresses this issue by selecting the most representative or informative profiles pairs to be labeled. The informativeness is determined by the capacity to diminish the uncertainty of the model. Conversely, representativeness evaluates whether a selected instance effectively reflects the overall input patterns of unlabeled data. Traditional ActiveML techniques typically rely on one strategy, Which may severely restrict the performance of the ActiveML process and lead to slow convergence. Especially in ER problems with a lack of initial training data. In this paper, we overcame this issue by inventing an approach for balancing the two above strategies. The implemented solution named EBEES (Epsilon-based Balancing Exploration and Exploitation Strategy), Which contains two variations: Adaptive-\(\epsilon \) and \(\epsilon \)-decreasing. We evaluated the EBEES on twelve datasets. Comparing the EBEES strategy against the state-of-the-art methods, without an initial training data, showed an enhanced performance in terms of F1-score, model stability, and rapid convergence.

Abstract Image

查看原文本刊更多论文

ERABQS：基于主动机器学习和平衡查询策略的实体解析

实体解析（ER）是数据管理和集成领域的一个重要过程。实体解析的主要目标是识别数据库中指向同一现实世界实体的不同配置文件（或记录）。具有挑战性的问题是，标注大量档案样本可能非常昂贵和耗时。主动机器学习（ActiveML）通过选择最具代表性或信息量最大的配置文件对进行标注来解决这一问题。信息量取决于降低模型不确定性的能力。反之，代表性则评估所选实例是否能有效反映未标记数据的整体输入模式。传统的 ActiveML 技术通常依赖于一种策略，这可能会严重限制 ActiveML 过程的性能，导致收敛缓慢。尤其是在缺乏初始训练数据的 ER 问题中。在本文中，我们发明了一种平衡上述两种策略的方法，从而克服了这一问题。所实现的解决方案被命名为 EBEES（基于 Epsilon 的平衡探索和利用策略），它包含两种变化：自适应-（\epsilon \）和（\epsilon \）-递减。我们在 12 个数据集上对 EBEES 进行了评估。在没有初始训练数据的情况下，将 EBEES 策略与最先进的方法进行比较，结果显示，EBEES 在 F1 分数、模型稳定性和快速收敛方面的性能都有所提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Intelligent Information Systems 工程技术-计算机：人工智能

CiteScore

7.20

自引率

11.80%

发文量

审稿时长

6-12 weeks

期刊介绍： The mission of the Journal of Intelligent Information Systems: Integrating Artifical Intelligence and Database Technologies is to foster and present research and development results focused on the integration of artificial intelligence and database technologies to create next generation information systems - Intelligent Information Systems. These new information systems embody knowledge that allows them to exhibit intelligent behavior, cooperate with users and other systems in problem solving, discovery, access, retrieval and manipulation of a wide variety of multimedia data and knowledge, and reason under uncertainty. Increasingly, knowledge-directed inference processes are being used to: discover knowledge from large data collections, provide cooperative support to users in complex query formulation and refinement, access, retrieve, store and manage large collections of multimedia data and knowledge, integrate information from multiple heterogeneous data and knowledge sources, and reason about information under uncertain conditions. Multimedia and hypermedia information systems now operate on a global scale over the Internet, and new tools and techniques are needed to manage these dynamic and evolving information spaces. The Journal of Intelligent Information Systems provides a forum wherein academics, researchers and practitioners may publish high-quality, original and state-of-the-art papers describing theoretical aspects, systems architectures, analysis and design tools and techniques, and implementation experiences in intelligent information systems. The categories of papers published by JIIS include: research papers, invited papters, meetings, workshop and conference annoucements and reports, survey and tutorial articles, and book reviews. Short articles describing open problems or their solutions are also welcome.