{"title":"An Empirical Study of Sensitive Information in Logs","authors":"Roozbeh Aghili, Heng Li, Foutse Khomh","doi":"arxiv-2409.11313","DOIUrl":null,"url":null,"abstract":"Software logs, generated during the runtime of software systems, are\nessential for various development and analysis activities, such as anomaly\ndetection and failure diagnosis. However, the presence of sensitive information\nin these logs poses significant privacy concerns, particularly regarding\nPersonally Identifiable Information (PII) and quasi-identifiers that could lead\nto re-identification risks. While general data privacy has been extensively\nstudied, the specific domain of privacy in software logs remains underexplored,\nwith inconsistent definitions of sensitivity and a lack of standardized\nguidelines for anonymization. To mitigate this gap, this study offers a\ncomprehensive analysis of privacy in software logs from multiple perspectives.\nWe start by performing an analysis of 25 publicly available log datasets to\nidentify potentially sensitive attributes. Based on the result of this step, we\nfocus on three perspectives: privacy regulations, research literature, and\nindustry practices. We first analyze key data privacy regulations, such as the\nGeneral Data Protection Regulation (GDPR) and the California Consumer Privacy\nAct (CCPA), to understand the legal requirements concerning sensitive\ninformation in logs. Second, we conduct a systematic literature review to\nidentify common privacy attributes and practices in log anonymization,\nrevealing gaps in existing approaches. Finally, we survey 45 industry\nprofessionals to capture practical insights on log anonymization practices. Our\nfindings shed light on various perspectives of log privacy and reveal industry\nchallenges, such as technical and efficiency issues while highlighting the need\nfor standardized guidelines. By combining insights from regulatory, academic,\nand industry perspectives, our study aims to provide a clearer framework for\nidentifying and protecting sensitive information in software logs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Software logs, generated during the runtime of software systems, are
essential for various development and analysis activities, such as anomaly
detection and failure diagnosis. However, the presence of sensitive information
in these logs poses significant privacy concerns, particularly regarding
Personally Identifiable Information (PII) and quasi-identifiers that could lead
to re-identification risks. While general data privacy has been extensively
studied, the specific domain of privacy in software logs remains underexplored,
with inconsistent definitions of sensitivity and a lack of standardized
guidelines for anonymization. To mitigate this gap, this study offers a
comprehensive analysis of privacy in software logs from multiple perspectives.
We start by performing an analysis of 25 publicly available log datasets to
identify potentially sensitive attributes. Based on the result of this step, we
focus on three perspectives: privacy regulations, research literature, and
industry practices. We first analyze key data privacy regulations, such as the
General Data Protection Regulation (GDPR) and the California Consumer Privacy
Act (CCPA), to understand the legal requirements concerning sensitive
information in logs. Second, we conduct a systematic literature review to
identify common privacy attributes and practices in log anonymization,
revealing gaps in existing approaches. Finally, we survey 45 industry
professionals to capture practical insights on log anonymization practices. Our
findings shed light on various perspectives of log privacy and reveal industry
challenges, such as technical and efficiency issues while highlighting the need
for standardized guidelines. By combining insights from regulatory, academic,
and industry perspectives, our study aims to provide a clearer framework for
identifying and protecting sensitive information in software logs.