通过在教育中生成综合数据来确保隐私

IF 8.1 1区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

British Journal of Educational Technology Pub Date : 2025-02-19 DOI:10.1111/bjet.13576

Qinyi Liu, Ronas Shakya, Jelena Jovanovic, Mohammad Khalil, Javier de la Hoz-Ruiz

{"title":"通过在教育中生成综合数据来确保隐私","authors":"Qinyi Liu, Ronas Shakya, Jelena Jovanovic, Mohammad Khalil, Javier de la Hoz-Ruiz","doi":"10.1111/bjet.13576","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <p>High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of <i>private synthetic data</i>, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n </p><ul>\n \n <li>Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.</li>\n \n <li>Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.</li>\n \n <li>Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as <i>private synthetic data,</i> is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.</li>\n </ul>\n \n <p>What this study contributes\n </p><ul>\n \n <li>The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.</li>\n \n <li>The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.</li>\n \n <li>Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.</li>\n </ul>\n \n <p>Implications for practice and/or policy\n </p><ul>\n \n <li>Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.</li>\n \n <li>Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.</li>\n \n <li>The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.</li>\n \n <li>By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.</li>\n </ul>\n \n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"56 3","pages":"1053-1073"},"PeriodicalIF":8.1000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ensuring privacy through synthetic data generation in education\",\"authors\":\"Qinyi Liu, Ronas Shakya, Jelena Jovanovic, Mohammad Khalil, Javier de la Hoz-Ruiz\",\"doi\":\"10.1111/bjet.13576\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <p>High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of <i>private synthetic data</i>, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.</p>\\n </section>\\n \\n <section>\\n \\n <div>\\n \\n <div>\\n \\n <h3>Practitioner notes</h3>\\n <p>What is already known about this topic\\n </p><ul>\\n \\n <li>Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.</li>\\n \\n <li>Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.</li>\\n \\n <li>Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as <i>private synthetic data,</i> is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.</li>\\n </ul>\\n \\n <p>What this study contributes\\n </p><ul>\\n \\n <li>The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.</li>\\n \\n <li>The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.</li>\\n \\n <li>Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.</li>\\n </ul>\\n \\n <p>Implications for practice and/or policy\\n </p><ul>\\n \\n <li>Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.</li>\\n \\n <li>Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.</li>\\n \\n <li>The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.</li>\\n \\n <li>By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.</li>\\n </ul>\\n \\n </div>\\n </div>\\n </section>\\n </div>\",\"PeriodicalId\":48315,\"journal\":{\"name\":\"British Journal of Educational Technology\",\"volume\":\"56 3\",\"pages\":\"1053-1073\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British Journal of Educational Technology\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13576\",\"RegionNum\":1,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13576","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

海量、高质量和多样化的数据集对于推进教育领域的研究至关重要。然而，这些数据集通常包含敏感信息，对隐私构成重大挑战。传统的匿名技术无法满足GDPR等法规要求的隐私标准，这促使人们需要更强大的解决方案。合成数据已经成为一种很有前途的隐私保护方法，它允许在确保隐私的同时生成和共享模拟真实数据的数据集。尽管如此，在教育数据集上单独应用合成数据仍然容易受到诸如链接攻击之类的隐私威胁。因此，本研究首次探索了将综合数据与差异化隐私机制相结合的私有综合数据在教育领域的应用。考虑到数据实用性和隐私性的双重需求，我们研究了各种合成数据生成技术在保护敏感教育信息方面的性能。我们的研究集中在两个关键问题上：这些技术防止隐私威胁的能力及其对综合教育数据集的效用的影响。通过这项调查，我们的目标是弥合理解隐私和先进的隐私保护技术在教育背景下的效用之间的平衡的差距。传统的教育数据集隐私保护方法在确保数据效用和隐私之间的平衡方面并没有被证明是成功的。此外，这些方法往往缺乏经验评价和/或在实践中成功应用的证据。合成数据生成是一种最先进的隐私保护方法，已越来越多地用作数据发布和共享的真实数据集的替代品。然而，最近的研究表明，即使是合成数据也容易受到隐私威胁。差分隐私（DP）是量化和减轻隐私问题的黄金标准。它与合成数据（通常称为私有合成数据）的结合是目前确保数据隐私的最佳方法。然而，在教育领域尚未对私人合成数据进行研究。本研究首次将具有DP机制的合成数据生成方法应用于教育数据，对生成的合成数据的效用和隐私性进行了全面报告，并探讨了在教育数据集背景下影响合成数据生成器性能的因素。本研究的实验结果表明，在检查的教育数据集中，没有合成数据生成器在所有评估指标上始终优于其他合成数据生成器。相反，不同的生成器在各自精通的领域表现出色，例如隐私或实用程序。这项工作突出了综合数据生成技术在教育部门的潜力，为在保护隐私的教育研究中使用综合数据生成技术的未来发展铺平了道路。对实践和/或政策的影响实际应用的关键要点包括进行具体案例评估的重要性，仔细平衡数据隐私与效用，以及在使用私人合成数据生成器进行高精度计算任务时谨慎行事，特别是在本研究中强调的资源有限的环境中。教育研究人员和从业者可以利用合成数据在不损害学生隐私的情况下发布数据，从而促进开放科学的发展，为教育研究的进步做出贡献。dp合成数据生成器的强大隐私性能可能有助于减轻学生的隐私问题，同时培养他们对共享个人信息的信任。通过提高数据共享的透明度和安全性，dp合成数据生成器技术可以促进以学生为中心的数据治理实践，同时为制定负责任的数据使用政策提供强大的技术基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Ensuring privacy through synthetic data generation in education

High-volume, high-quality and diverse datasets are crucial for advancing research in the education field. However, such datasets often contain sensitive information that poses significant privacy challenges. Traditional anonymisation techniques fail to meet the privacy standards required by regulations like GDPR, prompting the need for more robust solutions. Synthetic data have emerged as a promising privacy-preserving approach, allowing for the generation and sharing of datasets that mimic real data while ensuring privacy. Still, the application of synthetic data alone on educational datasets remains vulnerable to privacy threats such as linkage attacks. Therefore, this study explores for the first time the application of private synthetic data, which combines synthetic data with differential privacy mechanisms, in the education sector. By considering the dual needs of data utility and privacy, we investigate the performance of various synthetic data generation techniques in safeguarding sensitive educational information. Our research focuses on two key questions: the capability of these techniques to prevent privacy threats and their impact on the utility of synthetic educational datasets. Through this investigation, we aim to bridge the gap in understanding the balance between privacy and utility of advanced privacy-preserving techniques within educational contexts.

Practitioner notes

What is already known about this topic

Traditional privacy-preserving methods for educational datasets have not proven successful in ensuring a balance of data utility and privacy. Additionally, these methods often lack empirical evaluation and/or evidence of successful application in practice.
Synthetic data generation is a state-of-the-art privacy-preserving method that has been increasingly used as a substitute for real datasets for data publishing and sharing. However, recent research has demonstrated that even synthetic data are vulnerable to privacy threats.
Differential privacy (DP) is the gold standard for quantifying and mitigating privacy concerns. Its combination with synthetic data, often referred to as private synthetic data, is presently the best available approach to ensuring data privacy. However, private synthetic data have not been studied in the educational domain.

What this study contributes

The study has applied synthetic data generation methods with DP mechanisms to educational data for the first time, provided a comprehensive report on the utility and privacy of the resulting synthetic data, and explored factors affecting the performance of synthetic data generators in the context of educational datasets.
The experimental results of this study indicate that no synthetic data generator consistently outperforms others across all evaluation metrics in the examined educational datasets. Instead, different generators excel in their respective areas of proficiency, such as privacy or utility.
Highlighting the potential of synthetic data generation techniques in the education sector, this work paves the way for future developments in the use of synthetic data generation for privacy-preserving educational research.

Implications for practice and/or policy

Key takeaways for practical application include the importance of conducting case-specific evaluations, carefully balancing data privacy with utility and exercising caution when using private synthetic data generators for high-precision computational tasks, especially in resource-limited settings as highlighted in this study.
Educational researchers and practitioners can leverage synthetic data to release data without compromising student privacy, thereby promoting the development of open science and contributing to the advancement of education research.
The robust privacy performance of DP-synthetic data generators may help alleviate students' privacy concerns while fostering their trust in sharing personal information.
By improving the transparency and security of data sharing, DP-synthetic data generators technologies can promote student-centred data governance practices while providing a strong technical foundation for developing responsible data usage policies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

British Journal of Educational Technology EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

15.60

自引率

4.50%

发文量

111

期刊介绍： BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.