Using Transfer Learning to Identify Privacy Leaks in Tweets

Saul Ricardo Medrano Castillo, Zhiyuan Chen
{"title":"Using Transfer Learning to Identify Privacy Leaks in Tweets","authors":"Saul Ricardo Medrano Castillo, Zhiyuan Chen","doi":"10.1109/CIC.2016.078","DOIUrl":null,"url":null,"abstract":"Users of online social networks often disclose a lot of sensitive information intentionally or unintentionally, allowing different organizations such as the government, advertising companies, or criminals to exploit such information. In this paper, we focus on identifying privacy leaks such as being pregnant and being drunk in the content of tweets. This problem is non trivial for two reasons. First, we need to differentiate tweets that indeed contain privacy leaks from tweets that do not. e.g., a tweet may talk about a celebrity getting pregnant or selling products for pregnant women and thus is not privacy sensitive. Second, most existing solutions build a supervised learning model for each type of private leaks, but there could be many types of leaks so such solutions require labeling a large number of tweets for each type of leaks, which could be quite tedious and not easily generalizable. Our main contribution is that we apply transfer learning techniques such that we can use training data for one type of privacy leaks for another type of leaks which shares some common ground but is not exactly the same. This greatly reduces the labeling effort and makes our solution more generalizable. Experimental results validated the benefit of our approach: only 7% of data for the new type of leaks need to be labeled to achieve similar results as using 100% labeled data.","PeriodicalId":438546,"journal":{"name":"2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC.2016.078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Users of online social networks often disclose a lot of sensitive information intentionally or unintentionally, allowing different organizations such as the government, advertising companies, or criminals to exploit such information. In this paper, we focus on identifying privacy leaks such as being pregnant and being drunk in the content of tweets. This problem is non trivial for two reasons. First, we need to differentiate tweets that indeed contain privacy leaks from tweets that do not. e.g., a tweet may talk about a celebrity getting pregnant or selling products for pregnant women and thus is not privacy sensitive. Second, most existing solutions build a supervised learning model for each type of private leaks, but there could be many types of leaks so such solutions require labeling a large number of tweets for each type of leaks, which could be quite tedious and not easily generalizable. Our main contribution is that we apply transfer learning techniques such that we can use training data for one type of privacy leaks for another type of leaks which shares some common ground but is not exactly the same. This greatly reduces the labeling effort and makes our solution more generalizable. Experimental results validated the benefit of our approach: only 7% of data for the new type of leaks need to be labeled to achieve similar results as using 100% labeled data.
使用迁移学习识别推文中的隐私泄露
在线社交网络的用户经常有意无意地泄露大量敏感信息,使政府、广告公司或犯罪分子等不同组织可以利用这些信息。在本文中,我们着重于识别推文内容中的怀孕、醉酒等隐私泄露。由于两个原因,这个问题不容忽视。首先,我们需要区分确实包含隐私泄露的推文和不包含隐私泄露的推文。例如,一条推特可能会谈论一位名人怀孕或为孕妇销售产品,因此对隐私不敏感。其次,大多数现有的解决方案都为每种类型的私有泄漏构建了一个监督学习模型,但是可能有许多类型的泄漏,因此这样的解决方案需要为每种类型的泄漏标记大量的tweet,这可能非常繁琐,而且不容易推广。我们的主要贡献是我们应用了迁移学习技术,这样我们就可以将一种类型的隐私泄漏的训练数据用于另一种类型的泄漏,这些泄漏有一些共同点,但并不完全相同。这大大减少了标记工作,使我们的解决方案更具通用性。实验结果验证了我们方法的好处:对于新型泄漏,只需要标记7%的数据就可以达到与使用100%标记数据相似的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信