审计和生成具有可控信任权衡的合成数据

IF 3.8 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2024-10-10 DOI:10.1109/JETCAS.2024.3477976

Brian Belgodere;Pierre Dognin;Adam Ivankay;Igor Melnyk;Youssef Mroueh;Aleksandra Mojsilović;Jiri Navratil;Apoorva Nitsure;Inkit Padhi;Mattia Rigotti;Jerret Ross;Yair Schiff;Radhika Vedpathak;Richard A. Young

{"title":"审计和生成具有可控信任权衡的合成数据","authors":"Brian Belgodere;Pierre Dognin;Adam Ivankay;Igor Melnyk;Youssef Mroueh;Aleksandra Mojsilović;Jiri Navratil;Apoorva Nitsure;Inkit Padhi;Mattia Rigotti;Jerret Ross;Yair Schiff;Radhika Vedpathak;Richard A. Young","doi":"10.1109/JETCAS.2024.3477976","DOIUrl":null,"url":null,"abstract":"Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues by enabling a paradigm that relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensuring fidelity to the source data, and assessing utility, robustness, and privacy preservation. We demonstrate our framework’s effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with “TrustFormers” across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 4","pages":"773-788"},"PeriodicalIF":3.8000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10713321","citationCount":"0","resultStr":"{\"title\":\"Auditing and Generating Synthetic Data With Controllable Trust Trade-Offs\",\"authors\":\"Brian Belgodere;Pierre Dognin;Adam Ivankay;Igor Melnyk;Youssef Mroueh;Aleksandra Mojsilović;Jiri Navratil;Apoorva Nitsure;Inkit Padhi;Mattia Rigotti;Jerret Ross;Yair Schiff;Radhika Vedpathak;Richard A. Young\",\"doi\":\"10.1109/JETCAS.2024.3477976\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues by enabling a paradigm that relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensuring fidelity to the source data, and assessing utility, robustness, and privacy preservation. We demonstrate our framework’s effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with “TrustFormers” across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"14 4\",\"pages\":\"773-788\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10713321\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10713321/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10713321/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

现实世界的数据往往存在偏差、不平衡和隐私风险。为了解决这些问题，合成数据集应运而生，这种模式依靠生成式人工智能模型生成无偏见、保护隐私的数据，同时保持与原始数据的保真度。然而，评估合成数据集和模型的可信度是一项严峻的挑战。我们引入了一个整体审核框架，可全面评估合成数据集和人工智能模型。它侧重于防止偏见和歧视，确保忠于源数据，以及评估实用性、稳健性和隐私保护。我们通过审核教育、医疗保健、银行和人力资源等不同使用案例中的各种生成模型，以及表格、时间序列、视觉和自然语言等不同数据模式，展示了我们框架的有效性。这种整体评估对于遵守监管保障措施至关重要。我们引入了一种可信度指数，可根据合成数据集的保障措施权衡对其进行排序。此外，我们还介绍了在训练过程中以可信度为导向的模型选择和交叉验证过程，并在各种数据类型中以 "TrustFormers "为例进行说明。这种方法允许在创建合成数据时进行可控的可信度权衡。我们的审核框架促进了利益相关者之间的合作，包括数据科学家、治理专家、内部审核人员、外部认证人员和监管机构。这种透明的报告应成为防止偏见、歧视和侵犯隐私的标准做法，确保符合政策并提供责任、安全和性能保证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Auditing and Generating Synthetic Data With Controllable Trust Trade-Offs

Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues by enabling a paradigm that relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensuring fidelity to the source data, and assessing utility, robustness, and privacy preservation. We demonstrate our framework’s effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with “TrustFormers” across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.