A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ

Razieh Nokhbeh Zaeem, K. S. Barber
{"title":"A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ","authors":"Razieh Nokhbeh Zaeem, K. S. Barber","doi":"10.1145/3422337.3447827","DOIUrl":null,"url":null,"abstract":"Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review the statistics of this corpus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers' disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy policy summarization by providing a benchmark, and can be used in unsupervised machine learning to summarize privacy policies.","PeriodicalId":187272,"journal":{"name":"Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3422337.3447827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review the statistics of this corpus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers' disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy policy summarization by providing a benchmark, and can be used in unsupervised machine learning to summarize privacy policies.
基于DMOZ的大型公开网站隐私政策语料库
研究表明,网站的隐私政策太长,很难让目标受众理解。这些研究以及最近利用机器学习和自然语言处理来自动总结隐私政策的研究,即使不依赖于从网络上收集的隐私政策语料库,也会极大地受益。虽然已经有一些较小的网络隐私政策注释语料库公开,但我们还没有发现任何大型的公开语料库。我们使用DMOZ,一个大型的网络开放内容目录,它手动分类了150万个网站,收集了数十万个与其类别相关的隐私政策,从而可以研究不同类别/市场部门的隐私政策。我们回顾了这个语料库的统计数据,并使其可供研究。我们还获得了有关隐私政策的宝贵见解,例如,哪些网站发布的频率较低。我们的网络隐私政策语料库是研究人员调查隐私政策的宝贵工具。例如,它通过提供基准来方便不同隐私策略总结方法之间的比较,并且可以在无监督机器学习中使用来总结隐私策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信