利用更好的侧数据的机器学习的重新识别策略

Eina Hashimoto, Masatsugu Ichino, H. Yoshiura
{"title":"利用更好的侧数据的机器学习的重新识别策略","authors":"Eina Hashimoto, Masatsugu Ichino, H. Yoshiura","doi":"10.1109/ICAwST.2019.8923378","DOIUrl":null,"url":null,"abstract":"Data on people’s daily activities are being collected as big data and then mined for corporate and public purposes. However, concern about privacy is the major obstacle to using such data. Although data anonymisation can mitigate the privacy risk, researchers have shown that people can often be re-identified by linking the anonymised data with other data (i.e. side data). Though re-identification has been improved by the use of machine learning, two problems remain: the unavailability of data and the inappropriateness of the side data. We have developed a re-identification strategy using machine learning in which anonymised data are linked to a different type of side data that are easy to obtain and for which the represented person is easy to identify. We tested our strategy on the re-identification of the owners of 78 anonymous Twitter accounts by using information on résumés as side data. We linked the Twitter posts and résumés by estimating the profiles of the account owners and matching them with those described in the résumés. We were able to link roughly 50% of the accounts to their owners when all tweets were used and roughly 20% of the accounts when only the latest 63 tweets were used. The proposed strategy would help people be better aware of the privacy risks of using personal data and hopefully lead to the implementation of improved protection measures.","PeriodicalId":156538,"journal":{"name":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Re-Identification Strategy Using Machine Learning that Exploits Better Side Data\",\"authors\":\"Eina Hashimoto, Masatsugu Ichino, H. Yoshiura\",\"doi\":\"10.1109/ICAwST.2019.8923378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data on people’s daily activities are being collected as big data and then mined for corporate and public purposes. However, concern about privacy is the major obstacle to using such data. Although data anonymisation can mitigate the privacy risk, researchers have shown that people can often be re-identified by linking the anonymised data with other data (i.e. side data). Though re-identification has been improved by the use of machine learning, two problems remain: the unavailability of data and the inappropriateness of the side data. We have developed a re-identification strategy using machine learning in which anonymised data are linked to a different type of side data that are easy to obtain and for which the represented person is easy to identify. We tested our strategy on the re-identification of the owners of 78 anonymous Twitter accounts by using information on résumés as side data. We linked the Twitter posts and résumés by estimating the profiles of the account owners and matching them with those described in the résumés. We were able to link roughly 50% of the accounts to their owners when all tweets were used and roughly 20% of the accounts when only the latest 63 tweets were used. The proposed strategy would help people be better aware of the privacy risks of using personal data and hopefully lead to the implementation of improved protection measures.\",\"PeriodicalId\":156538,\"journal\":{\"name\":\"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAwST.2019.8923378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAwST.2019.8923378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

人们日常活动的数据被收集为大数据,然后为企业和公共目的进行挖掘。然而,对隐私的担忧是使用此类数据的主要障碍。虽然数据匿名化可以减轻隐私风险,但研究人员已经表明,通过将匿名数据与其他数据(即侧数据)联系起来,人们通常可以被重新识别。虽然通过使用机器学习改进了重新识别,但仍然存在两个问题:数据不可用和侧数据不适当。我们使用机器学习开发了一种重新识别策略,其中匿名数据与不同类型的侧数据相关联,这些侧数据易于获得,并且代表的人易于识别。我们测试了我们的策略,重新识别了78个匿名Twitter账户的所有者,使用了有关rsamsumen的信息作为辅助数据。我们通过估算账户所有者的个人资料,并将其与rsamsumsams中描述的信息进行匹配,将Twitter帖子与rsamsumsams联系起来。当所有推文都被使用时,我们能够将大约50%的账户链接到它们的所有者,而当只使用最近的63条推文时,我们能够将大约20%的账户链接到它们的所有者。建议的策略将有助市民更清楚认识使用个人资料的私隐风险,并有望推动实施更完善的保护措施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Re-Identification Strategy Using Machine Learning that Exploits Better Side Data
Data on people’s daily activities are being collected as big data and then mined for corporate and public purposes. However, concern about privacy is the major obstacle to using such data. Although data anonymisation can mitigate the privacy risk, researchers have shown that people can often be re-identified by linking the anonymised data with other data (i.e. side data). Though re-identification has been improved by the use of machine learning, two problems remain: the unavailability of data and the inappropriateness of the side data. We have developed a re-identification strategy using machine learning in which anonymised data are linked to a different type of side data that are easy to obtain and for which the represented person is easy to identify. We tested our strategy on the re-identification of the owners of 78 anonymous Twitter accounts by using information on résumés as side data. We linked the Twitter posts and résumés by estimating the profiles of the account owners and matching them with those described in the résumés. We were able to link roughly 50% of the accounts to their owners when all tweets were used and roughly 20% of the accounts when only the latest 63 tweets were used. The proposed strategy would help people be better aware of the privacy risks of using personal data and hopefully lead to the implementation of improved protection measures.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信