{"title":"实体决议对观察到的社会网络结构的影响","authors":"Abby M. Smith","doi":"10.11159/icsta21.136","DOIUrl":null,"url":null,"abstract":"Extended Abstract Deduplication, also referred to as \"entity resolution\", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, \"ground-truth\" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Impact of Entity Resolution on Observed Social Network Structure\",\"authors\":\"Abby M. Smith\",\"doi\":\"10.11159/icsta21.136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extended Abstract Deduplication, also referred to as \\\"entity resolution\\\", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, \\\"ground-truth\\\" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.\",\"PeriodicalId\":403959,\"journal\":{\"name\":\"Proceedings of the 3rd International Conference on Statistics: Theory and Applications\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Conference on Statistics: Theory and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11159/icsta21.136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11159/icsta21.136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The Impact of Entity Resolution on Observed Social Network Structure
Extended Abstract Deduplication, also referred to as "entity resolution", is a common and crucial pre-processing step in the construction of social networks [1]. Citation network studies have indicated that false “splitting” and “lumping” of nodes can have dramatic downstream network impacts, and choices in deduplication methods are important for network analysis [2] [3]. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Fellegi and Sunter (1969) [4] introduced an optimal decision threshold where above a certain matching score, pairs are declared a match, and below that threshold, pairs are considered a non-match. Recently research has used clustering techniques for entity resolution, where each cluster represents a unique underlying entity. Collective clustering techniques, pioneered by Bhattacharya and Getoor (2007) [5], relax unrealistic assumptions made by earlier probabilistic entity resolution techniques and allow matching decisions to be made dependent on each other. In social network datasets, we can also use relational information (e.g., a person’s network ties) in deduplication as further evidence for matching status of pair. Entity resolution is inherently an imperfect process and is an outcome of existing measurement error, particularly when there is a lack of a manually-reviewed, "ground-truth" dataset to rely on for parameter tuning in a chosen technique [6]. I focus on two tuning parameters: the match decision threshold (t) in Felligi-Sunter, and the alpha trade-off parameter between attributional and relational similarity in Bhattacarya-Getoor. My work is focused on methods for evaluating entity resolution in a network setting, measuring the sensitivity of entity resolution results to choices in tuning parameters (alpha and t), and the downstream impacts these parameter choices can have on network metrics and topologies such as degree, closeness, and connectivity. I apply the evaluation methods to two real-world ego-centric network studies, (i) Care2Hope, a respondentdriven sample of rural people who use drugs (PWUD) in Appalachian Kentucky [1], and (ii) RADAR, a longitudinal network study of young men in Chicago who have sex with men. I consider evaluation scenarios in both the presence [7] and absence [8] of “ground truth” data . I discuss implications these findings could have for drug use and HIV policy, and make reporting recommendations for network analysts.