Qinyi Liu, Ronas Shakya, Jelena Jovanovic, Mohammad Khalil, Javier de la Hoz-Ruiz
{"title":"Ensuring privacy through synthetic data generation in education","authors":"Qinyi Liu, Ronas Shakya, Jelena Jovanovic, Mohammad Khalil, Javier de la Hoz-Ruiz","doi":"10.1111/bjet.13576","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <p>High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of <i>private synthetic data</i>, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n </p><ul>\n \n <li>Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.</li>\n \n <li>Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.</li>\n \n <li>Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as <i>private synthetic data,</i> is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.</li>\n </ul>\n \n <p>What this study contributes\n </p><ul>\n \n <li>The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.</li>\n \n <li>The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.</li>\n \n <li>Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.</li>\n </ul>\n \n <p>Implications for practice and/or policy\n </p><ul>\n \n <li>Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.</li>\n \n <li>Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.</li>\n \n <li>The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.</li>\n \n <li>By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.</li>\n </ul>\n \n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"56 3","pages":"1053-1073"},"PeriodicalIF":6.7000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13576","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of private synthetic data, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.
Practitioner notes
What is already known about this topic
Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.
Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.
Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as private synthetic data, is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.
What this study contributes
The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.
The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.
Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.
Implications for practice and/or policy
Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.
Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.
The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.
By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.
期刊介绍:
BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.