{"title":"利用更好的侧数据的机器学习的重新识别策略","authors":"Eina Hashimoto, Masatsugu Ichino, H. Yoshiura","doi":"10.1109/ICAwST.2019.8923378","DOIUrl":null,"url":null,"abstract":"Data on people’s daily activities are being collected as big data and then mined for corporate and public purposes. However, concern about privacy is the major obstacle to using such data. Although data anonymisation can mitigate the privacy risk, researchers have shown that people can often be re-identified by linking the anonymised data with other data (i.e. side data). Though re-identification has been improved by the use of machine learning, two problems remain: the unavailability of data and the inappropriateness of the side data. We have developed a re-identification strategy using machine learning in which anonymised data are linked to a different type of side data that are easy to obtain and for which the represented person is easy to identify. We tested our strategy on the re-identification of the owners of 78 anonymous Twitter accounts by using information on résumés as side data. We linked the Twitter posts and résumés by estimating the profiles of the account owners and matching them with those described in the résumés. We were able to link roughly 50% of the accounts to their owners when all tweets were used and roughly 20% of the accounts when only the latest 63 tweets were used. The proposed strategy would help people be better aware of the privacy risks of using personal data and hopefully lead to the implementation of improved protection measures.","PeriodicalId":156538,"journal":{"name":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Re-Identification Strategy Using Machine Learning that Exploits Better Side Data\",\"authors\":\"Eina Hashimoto, Masatsugu Ichino, H. Yoshiura\",\"doi\":\"10.1109/ICAwST.2019.8923378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data on people’s daily activities are being collected as big data and then mined for corporate and public purposes. However, concern about privacy is the major obstacle to using such data. Although data anonymisation can mitigate the privacy risk, researchers have shown that people can often be re-identified by linking the anonymised data with other data (i.e. side data). Though re-identification has been improved by the use of machine learning, two problems remain: the unavailability of data and the inappropriateness of the side data. We have developed a re-identification strategy using machine learning in which anonymised data are linked to a different type of side data that are easy to obtain and for which the represented person is easy to identify. We tested our strategy on the re-identification of the owners of 78 anonymous Twitter accounts by using information on résumés as side data. We linked the Twitter posts and résumés by estimating the profiles of the account owners and matching them with those described in the résumés. We were able to link roughly 50% of the accounts to their owners when all tweets were used and roughly 20% of the accounts when only the latest 63 tweets were used. The proposed strategy would help people be better aware of the privacy risks of using personal data and hopefully lead to the implementation of improved protection measures.\",\"PeriodicalId\":156538,\"journal\":{\"name\":\"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAwST.2019.8923378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAwST.2019.8923378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Re-Identification Strategy Using Machine Learning that Exploits Better Side Data
Data on people’s daily activities are being collected as big data and then mined for corporate and public purposes. However, concern about privacy is the major obstacle to using such data. Although data anonymisation can mitigate the privacy risk, researchers have shown that people can often be re-identified by linking the anonymised data with other data (i.e. side data). Though re-identification has been improved by the use of machine learning, two problems remain: the unavailability of data and the inappropriateness of the side data. We have developed a re-identification strategy using machine learning in which anonymised data are linked to a different type of side data that are easy to obtain and for which the represented person is easy to identify. We tested our strategy on the re-identification of the owners of 78 anonymous Twitter accounts by using information on résumés as side data. We linked the Twitter posts and résumés by estimating the profiles of the account owners and matching them with those described in the résumés. We were able to link roughly 50% of the accounts to their owners when all tweets were used and roughly 20% of the accounts when only the latest 63 tweets were used. The proposed strategy would help people be better aware of the privacy risks of using personal data and hopefully lead to the implementation of improved protection measures.