实体决议对观察到的社会网络结构的影响

Proceedings of the 3rd International Conference on Statistics: Theory and Applications Pub Date : 2021-08-01 DOI:10.11159/icsta21.136

Abby M. Smith

{"title":"实体决议对观察到的社会网络结构的影响","authors":"Abby M. Smith","doi":"10.11159/icsta21.136","DOIUrl":null,"url":null,"abstract":"Extended Abstract Deduplication, also referred to as \"entity resolution\", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, \"ground-truth\" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Impact of Entity Resolution on Observed Social Network Structure\",\"authors\":\"Abby M. Smith\",\"doi\":\"10.11159/icsta21.136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extended Abstract Deduplication, also referred to as \\\"entity resolution\\\", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, \\\"ground-truth\\\" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.\",\"PeriodicalId\":403959,\"journal\":{\"name\":\"Proceedings of the 3rd International Conference on Statistics: Theory and Applications\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Conference on Statistics: Theory and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11159/icsta21.136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11159/icsta21.136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

重复数据删除，又称“实体解析”，是构建社交网络[1]过程中常见而关键的预处理步骤。引文网络研究表明，错误的节点“分裂”和“集总”会对下游网络产生巨大影响，选择重复数据删除方法对网络分析很重要[2][3]。传统的重复数据删除方法通过比较潜在匹配对的属性(如姓名、年龄等)来估计匹配概率。Fellegi和Sunter(1969)引入了一个最优决策阈值，在该阈值以上的配对被声明为匹配，低于该阈值的配对被认为是不匹配。最近的研究使用聚类技术进行实体解析，其中每个聚类代表一个唯一的底层实体。由Bhattacharya和Getoor(2007)首创的集体聚类技术放宽了早期概率实体解析技术所做的不切实际的假设，并允许相互依赖地做出匹配决策。在社交网络数据集中，我们还可以在重复数据删除中使用关系信息(例如，一个人的网络联系)作为pair匹配状态的进一步证据。实体分辨率本质上是一个不完美的过程，是现有测量误差的结果，特别是在缺乏人工审查的情况下，“基本事实”数据集可以依赖于所选技术[6]的参数调整。本文重点研究了两个调优参数:feligi - sunter算法中的匹配决策阈值(t)和Bhattacarya-Getoor算法中的归因相似性和关系相似性之间的alpha权衡参数。我的工作重点是评估网络设置中的实体分辨率的方法，测量实体分辨率结果对调整参数(alpha和t)选择的敏感性，以及这些参数选择可能对网络指标和拓扑(如程度，亲密度和连通性)产生的下游影响。我将评估方法应用于两个现实世界的以自我为中心的网络研究，(I) Care2Hope，一个在肯塔基州阿巴拉契亚地区使用毒品的农村人(PWUD)的受访者驱动样本，(ii) RADAR，一个在芝加哥与男性发生性关系的年轻男性的纵向网络研究。我考虑了存在[7]和不存在[8]的“真实”数据的评估场景。我讨论了这些发现可能对药物使用和艾滋病毒政策的影响，并为网络分析师提出了报告建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Impact of Entity Resolution on Observed Social Network Structure

Extended Abstract Deduplication, also referred to as "entity resolution", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, "ground-truth" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 3rd International Conference on Statistics: Theory and Applications

自引率

0.00%

发文量