Ensuring privacy through synthetic data generation in education

IF 6.7 1区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Qinyi Liu, Ronas Shakya, Jelena Jovanovic, Mohammad Khalil, Javier de la Hoz-Ruiz
{"title":"Ensuring privacy through synthetic data generation in education","authors":"Qinyi Liu,&nbsp;Ronas Shakya,&nbsp;Jelena Jovanovic,&nbsp;Mohammad Khalil,&nbsp;Javier de la Hoz-Ruiz","doi":"10.1111/bjet.13576","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <p>High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of <i>private synthetic data</i>, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n </p><ul>\n \n <li>Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.</li>\n \n <li>Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.</li>\n \n <li>Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as <i>private synthetic data,</i> is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.</li>\n </ul>\n \n <p>What this study contributes\n </p><ul>\n \n <li>The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.</li>\n \n <li>The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.</li>\n \n <li>Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.</li>\n </ul>\n \n <p>Implications for practice and/or policy\n </p><ul>\n \n <li>Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.</li>\n \n <li>Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.</li>\n \n <li>The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.</li>\n \n <li>By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.</li>\n </ul>\n \n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"56 3","pages":"1053-1073"},"PeriodicalIF":6.7000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13576","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of private synthetic data, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.

Practitioner notes

What is already known about this topic

  • Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.
  • Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.
  • Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as private synthetic data, is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.

What this study contributes

  • The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.
  • The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.
  • Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.

Implications for practice and/or policy

  • Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.
  • Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.
  • The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.
  • By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.
求助全文
约1分钟内获得全文 求助全文
来源期刊
British Journal of Educational Technology
British Journal of Educational Technology EDUCATION & EDUCATIONAL RESEARCH-
CiteScore
15.60
自引率
4.50%
发文量
111
期刊介绍: BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信