用社交媒体数据增强调查:LinkedIn 数据链接的概率框架。

Paulo Matos Serodio, Tarek Al Baghal, Luke Sloan, Shujun Liu, C. Jessop
{"title":"用社交媒体数据增强调查:LinkedIn 数据链接的概率框架。","authors":"Paulo Matos Serodio, Tarek Al Baghal, Luke Sloan, Shujun Liu, C. Jessop","doi":"10.23889/ijpds.v9i4.2433","DOIUrl":null,"url":null,"abstract":"Introduction & BackgroundLinkedIn, with its extensive global network of over 900 million members across more than 200 countries, presents a unique repository for examining labour market dynamics, professional development, and the impact of social networking on employment opportunities. Despite its potential, LinkedIn's wealth of data on professional trajectories, skills, and labour market outcomes remains largely untapped in survey research due to challenges in data collection. \nObjectives & ApproachThis paper introduces a novel methodology for integrating LinkedIn data with survey responses using data from the fourteenth wave of the Innovation Panel (IP14) of Understanding Society: The UK Household Longitudinal Study (UKHLS), conducted in 2021. In IP14, we probed the extent of LinkedIn usage among the UK population and assessed users' willingness to link their LinkedIn profiles with their survey responses. Those consenting to link their accounts were asked for specific details — namely their first and last names, employer, and job title — to enable profile identification on LinkedIn. Faced with the unavailability of a unique platform identifier and the cessation of LinkedIn’s API, this information was crucial for matching profiles accurately. \nWe crafted a framework using PhantomBuster for ethical data extraction and a probabilistic string-matching technique to ensure precise linkage between survey responses and LinkedIn profiles. PhantomBuster, a cloud-based tool, efficiently scrapes dynamic content using JavaScript in a headless browser environment, sidestepping IP-related restrictions while adhering to website terms of service. It streamlines the data collection process. Identified profiles were subjected to an iterative probabilistic string matching, using respondent-provided metadata alongside supplementary data, to maximize the accuracy of matching the profiles to our survey participants. \nRelevance to Digital FootprintsThe described method advances digital footprint research in data collection and linkage. It automates the retrieval of vast online data sets; compiles information efficiently in an organized format; saves time and labour by mechanizing monotonous tasks; circumvents platform-imposed IP restrictions; and imposes fewer barriers to entry as it requires less technical skill than other scraping tools like Selenium. \nConclusions & ImplicationsThis approach not only facilitates the precise identification and collection of LinkedIn profile data but also sets a precedent for ethical considerations in web scraping practices. By documenting this methodology, we aim to equip researchers with a scalable and replicable tool for future studies, enriching the analysis of labour market outcomes and the interplay between formal education, informal training, and professional success through the integration of LinkedIn and survey data.","PeriodicalId":507952,"journal":{"name":"International Journal of Population Data Science","volume":"107 51","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Augmenting Surveys with Social Media Data: A Probabilistic Framework for LinkedIn Data Linkage.\",\"authors\":\"Paulo Matos Serodio, Tarek Al Baghal, Luke Sloan, Shujun Liu, C. Jessop\",\"doi\":\"10.23889/ijpds.v9i4.2433\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction & BackgroundLinkedIn, with its extensive global network of over 900 million members across more than 200 countries, presents a unique repository for examining labour market dynamics, professional development, and the impact of social networking on employment opportunities. Despite its potential, LinkedIn's wealth of data on professional trajectories, skills, and labour market outcomes remains largely untapped in survey research due to challenges in data collection. \\nObjectives & ApproachThis paper introduces a novel methodology for integrating LinkedIn data with survey responses using data from the fourteenth wave of the Innovation Panel (IP14) of Understanding Society: The UK Household Longitudinal Study (UKHLS), conducted in 2021. In IP14, we probed the extent of LinkedIn usage among the UK population and assessed users' willingness to link their LinkedIn profiles with their survey responses. Those consenting to link their accounts were asked for specific details — namely their first and last names, employer, and job title — to enable profile identification on LinkedIn. Faced with the unavailability of a unique platform identifier and the cessation of LinkedIn’s API, this information was crucial for matching profiles accurately. \\nWe crafted a framework using PhantomBuster for ethical data extraction and a probabilistic string-matching technique to ensure precise linkage between survey responses and LinkedIn profiles. PhantomBuster, a cloud-based tool, efficiently scrapes dynamic content using JavaScript in a headless browser environment, sidestepping IP-related restrictions while adhering to website terms of service. It streamlines the data collection process. Identified profiles were subjected to an iterative probabilistic string matching, using respondent-provided metadata alongside supplementary data, to maximize the accuracy of matching the profiles to our survey participants. \\nRelevance to Digital FootprintsThe described method advances digital footprint research in data collection and linkage. It automates the retrieval of vast online data sets; compiles information efficiently in an organized format; saves time and labour by mechanizing monotonous tasks; circumvents platform-imposed IP restrictions; and imposes fewer barriers to entry as it requires less technical skill than other scraping tools like Selenium. \\nConclusions & ImplicationsThis approach not only facilitates the precise identification and collection of LinkedIn profile data but also sets a precedent for ethical considerations in web scraping practices. By documenting this methodology, we aim to equip researchers with a scalable and replicable tool for future studies, enriching the analysis of labour market outcomes and the interplay between formal education, informal training, and professional success through the integration of LinkedIn and survey data.\",\"PeriodicalId\":507952,\"journal\":{\"name\":\"International Journal of Population Data Science\",\"volume\":\"107 51\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Population Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23889/ijpds.v9i4.2433\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v9i4.2433","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

简介与背景LinkedIn拥有遍布200多个国家、超过9亿会员的广泛全球网络,为研究劳动力市场动态、职业发展以及社交网络对就业机会的影响提供了一个独特的资源库。尽管LinkedIn潜力巨大,但由于数据收集方面的挑战,有关职业轨迹、技能和劳动力市场结果的大量数据在调查研究中仍未得到充分利用。目标与方法 本文介绍了一种将LinkedIn数据与调查回答相结合的新方法,该方法使用的数据来自于 "理解社会 "的第14波创新小组(IP14):了解社会:英国家庭纵向研究》(UKHLS)第十四次调查(IP14)中的数据,介绍了将 LinkedIn 数据与调查回答进行整合的新方法。在IP14中,我们探究了LinkedIn在英国人口中的使用程度,并评估了用户是否愿意将他们的LinkedIn档案与他们的调查回答联系起来。那些同意链接其账户的人被要求提供具体细节--即他们的姓和名、雇主和职位--以便在LinkedIn上进行个人资料识别。由于无法获得唯一的平台标识符,LinkedIn 的应用程序接口也已停止使用,因此这些信息对于准确匹配个人资料至关重要。我们精心设计了一个框架,使用 PhantomBuster 进行道德数据提取,并使用概率字符串匹配技术确保调查回复与 LinkedIn 资料之间的精确联系。PhantomBuster是一款基于云的工具,可在无头浏览器环境中使用JavaScript高效地抓取动态内容,在遵守网站服务条款的同时避开与IP相关的限制。它简化了数据收集过程。通过使用受访者提供的元数据和补充数据,对识别出的个人资料进行迭代概率字符串匹配,以最大限度地提高个人资料与调查参与者匹配的准确性。与数字足迹的相关性所述方法推进了数据收集和链接方面的数字足迹研究。它可以自动检索庞大的在线数据集;以有组织的格式高效汇编信息;将单调的任务机械化,从而节省时间和人力;规避平台施加的知识产权限制;由于与 Selenium 等其他刮擦工具相比,它对技术技能的要求较低,因此降低了进入门槛。结论与启示这种方法不仅有助于精确识别和收集LinkedIn档案数据,还为网络搜索实践中的道德考量开创了先例。通过记录这种方法,我们旨在为研究人员未来的研究提供一个可扩展、可复制的工具,通过整合LinkedIn和调查数据,丰富对劳动力市场结果以及正规教育、非正规培训和职业成功之间相互作用的分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Augmenting Surveys with Social Media Data: A Probabilistic Framework for LinkedIn Data Linkage.
Introduction & BackgroundLinkedIn, with its extensive global network of over 900 million members across more than 200 countries, presents a unique repository for examining labour market dynamics, professional development, and the impact of social networking on employment opportunities. Despite its potential, LinkedIn's wealth of data on professional trajectories, skills, and labour market outcomes remains largely untapped in survey research due to challenges in data collection. Objectives & ApproachThis paper introduces a novel methodology for integrating LinkedIn data with survey responses using data from the fourteenth wave of the Innovation Panel (IP14) of Understanding Society: The UK Household Longitudinal Study (UKHLS), conducted in 2021. In IP14, we probed the extent of LinkedIn usage among the UK population and assessed users' willingness to link their LinkedIn profiles with their survey responses. Those consenting to link their accounts were asked for specific details — namely their first and last names, employer, and job title — to enable profile identification on LinkedIn. Faced with the unavailability of a unique platform identifier and the cessation of LinkedIn’s API, this information was crucial for matching profiles accurately. We crafted a framework using PhantomBuster for ethical data extraction and a probabilistic string-matching technique to ensure precise linkage between survey responses and LinkedIn profiles. PhantomBuster, a cloud-based tool, efficiently scrapes dynamic content using JavaScript in a headless browser environment, sidestepping IP-related restrictions while adhering to website terms of service. It streamlines the data collection process. Identified profiles were subjected to an iterative probabilistic string matching, using respondent-provided metadata alongside supplementary data, to maximize the accuracy of matching the profiles to our survey participants. Relevance to Digital FootprintsThe described method advances digital footprint research in data collection and linkage. It automates the retrieval of vast online data sets; compiles information efficiently in an organized format; saves time and labour by mechanizing monotonous tasks; circumvents platform-imposed IP restrictions; and imposes fewer barriers to entry as it requires less technical skill than other scraping tools like Selenium. Conclusions & ImplicationsThis approach not only facilitates the precise identification and collection of LinkedIn profile data but also sets a precedent for ethical considerations in web scraping practices. By documenting this methodology, we aim to equip researchers with a scalable and replicable tool for future studies, enriching the analysis of labour market outcomes and the interplay between formal education, informal training, and professional success through the integration of LinkedIn and survey data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信