TREATS: Fairness-aware entity resolution over streaming data

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2024-12-12 DOI:10.1016/j.is.2024.102506

Tiago Brasileiro Araújo , Vasilis Efthymiou , Vassilis Christophides , Evaggelia Pitoura , Kostas Stefanidis

{"title":"TREATS: Fairness-aware entity resolution over streaming data","authors":"Tiago Brasileiro Araújo , Vasilis Efthymiou , Vassilis Christophides , Evaggelia Pitoura , Kostas Stefanidis","doi":"10.1016/j.is.2024.102506","DOIUrl":null,"url":null,"abstract":"<div><div>Currently, the growing proliferation of information systems generates large volumes of data continuously, stemming from a variety of sources such as web platforms, social networks, and multiple devices. These data, often lacking a defined schema, require an initial process of consolidation and cleansing before analysis and knowledge extraction can occur. In this context, Entity Resolution (ER) plays a crucial role, facilitating the integration of knowledge bases and identifying similarities among entities from different sources. However, the traditional ER process is computationally expensive, and becomes more complicated in the streaming context where the data arrive continuously. Moreover, there is a lack of studies involving fairness and ER, which is related to the absence of discrimination or bias. In this sense, fairness criteria aim to mitigate the implications of data bias in ER systems, which requires more than just optimizing accuracy, as traditionally done. Considering this context, this work presents TREATS, a schema-agnostic and fairness-aware ER workflow developed for managing streaming data incrementally. The proposed fairness-aware ER framework tackles constraints across various groups of interest, presenting a resilient and equitable solution to the related challenges. Through experimental evaluation, the proposed techniques and heuristics are compared against state-of-the-art approaches over five real-world data source pairs, in which the results demonstrated significant improvements in terms of fairness, without degradation of effectiveness and efficiency measures in the streaming environment. In summary, our contributions aim to propel the ER field forward by providing a workflow that addresses both technical challenges and ethical concerns.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"129 ","pages":"Article 102506"},"PeriodicalIF":3.4000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924001649","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Currently, the growing proliferation of information systems generates large volumes of data continuously, stemming from a variety of sources such as web platforms, social networks, and multiple devices. These data, often lacking a defined schema, require an initial process of consolidation and cleansing before analysis and knowledge extraction can occur. In this context, Entity Resolution (ER) plays a crucial role, facilitating the integration of knowledge bases and identifying similarities among entities from different sources. However, the traditional ER process is computationally expensive, and becomes more complicated in the streaming context where the data arrive continuously. Moreover, there is a lack of studies involving fairness and ER, which is related to the absence of discrimination or bias. In this sense, fairness criteria aim to mitigate the implications of data bias in ER systems, which requires more than just optimizing accuracy, as traditionally done. Considering this context, this work presents TREATS, a schema-agnostic and fairness-aware ER workflow developed for managing streaming data incrementally. The proposed fairness-aware ER framework tackles constraints across various groups of interest, presenting a resilient and equitable solution to the related challenges. Through experimental evaluation, the proposed techniques and heuristics are compared against state-of-the-art approaches over five real-world data source pairs, in which the results demonstrated significant improvements in terms of fairness, without degradation of effectiveness and efficiency measures in the streaming environment. In summary, our contributions aim to propel the ER field forward by providing a workflow that addresses both technical challenges and ethical concerns.

查看原文本刊更多论文

对待：对流数据进行公平感知的实体解析

当前，信息系统的不断扩散，产生了大量的数据，这些数据来自各种各样的来源，如web平台、社交网络和多种设备。这些数据通常缺乏已定义的模式，在进行分析和知识提取之前，需要进行初始的整合和清理过程。在这种情况下，实体解析（ER）起着至关重要的作用，它促进了知识库的集成，并识别了来自不同来源的实体之间的相似性。然而，传统的ER过程计算成本很高，并且在数据连续到达的流环境中变得更加复杂。此外，缺乏涉及公平和ER的研究，这与缺乏歧视或偏见有关。从这个意义上说，公平性标准旨在减轻ER系统中数据偏差的影响，这不仅仅需要像传统那样优化准确性。考虑到这一背景，本工作提出了treat，这是一种模式无关且具有公平性意识的ER工作流，用于增量管理流数据。提出的具有公平性意识的ER框架解决了不同利益群体之间的限制，为相关挑战提供了一个有弹性和公平的解决方案。通过实验评估，将所提出的技术和启发式方法与五个现实世界数据源对的最先进方法进行了比较，结果表明，在公平性方面有了显着改善，而不会降低流环境中的有效性和效率措施。总之，我们的贡献旨在通过提供解决技术挑战和道德问题的工作流程来推动ER领域向前发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.