{"title":"Catch Me if You Can: Detecting Unauthorized Data Use in Deep Learning Models","authors":"Zitao Chen, Karthik Pattabiraman","doi":"arxiv-2409.06280","DOIUrl":null,"url":null,"abstract":"The rise of deep learning (DL) has led to a surging demand for training data,\nwhich incentivizes the creators of DL models to trawl through the Internet for\ntraining materials. Meanwhile, users often have limited control over whether\ntheir data (e.g., facial images) are used to train DL models without their\nconsent, which has engendered pressing concerns. This work proposes MembershipTracker, a practical data provenance tool that\ncan empower ordinary users to take agency in detecting the unauthorized use of\ntheir data in training DL models. We view tracing data provenance through the\nlens of membership inference (MI). MembershipTracker consists of a lightweight\ndata marking component to mark the target data with small and targeted changes,\nwhich can be strongly memorized by the model trained on them; and a specialized\nMI-based verification process to audit whether the model exhibits strong\nmemorization on the target samples. Overall, MembershipTracker only requires the users to mark a small fraction\nof data (0.005% to 0.1% in proportion to the training set), and it enables the\nusers to reliably detect the unauthorized use of their data (average 0%\nFPR@100% TPR). We show that MembershipTracker is highly effective across\nvarious settings, including industry-scale training on the full-size\nImageNet-1k dataset. We finally evaluate MembershipTracker under multiple\nclasses of countermeasures.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rise of deep learning (DL) has led to a surging demand for training data,
which incentivizes the creators of DL models to trawl through the Internet for
training materials. Meanwhile, users often have limited control over whether
their data (e.g., facial images) are used to train DL models without their
consent, which has engendered pressing concerns. This work proposes MembershipTracker, a practical data provenance tool that
can empower ordinary users to take agency in detecting the unauthorized use of
their data in training DL models. We view tracing data provenance through the
lens of membership inference (MI). MembershipTracker consists of a lightweight
data marking component to mark the target data with small and targeted changes,
which can be strongly memorized by the model trained on them; and a specialized
MI-based verification process to audit whether the model exhibits strong
memorization on the target samples. Overall, MembershipTracker only requires the users to mark a small fraction
of data (0.005% to 0.1% in proportion to the training set), and it enables the
users to reliably detect the unauthorized use of their data (average 0%
FPR@100% TPR). We show that MembershipTracker is highly effective across
various settings, including industry-scale training on the full-size
ImageNet-1k dataset. We finally evaluate MembershipTracker under multiple
classes of countermeasures.