An Empirical Study of Sensitive Information in Logs

arXiv - CS - Software Engineering Pub Date : 2024-09-17 DOI:arxiv-2409.11313

Roozbeh Aghili, Heng Li, Foutse Khomh

{"title":"An Empirical Study of Sensitive Information in Logs","authors":"Roozbeh Aghili, Heng Li, Foutse Khomh","doi":"arxiv-2409.11313","DOIUrl":null,"url":null,"abstract":"Software logs, generated during the runtime of software systems, are\nessential for various development and analysis activities, such as anomaly\ndetection and failure diagnosis. However, the presence of sensitive information\nin these logs poses significant privacy concerns, particularly regarding\nPersonally Identifiable Information (PII) and quasi-identifiers that could lead\nto re-identification risks. While general data privacy has been extensively\nstudied, the specific domain of privacy in software logs remains underexplored,\nwith inconsistent definitions of sensitivity and a lack of standardized\nguidelines for anonymization. To mitigate this gap, this study offers a\ncomprehensive analysis of privacy in software logs from multiple perspectives.\nWe start by performing an analysis of 25 publicly available log datasets to\nidentify potentially sensitive attributes. Based on the result of this step, we\nfocus on three perspectives: privacy regulations, research literature, and\nindustry practices. We first analyze key data privacy regulations, such as the\nGeneral Data Protection Regulation (GDPR) and the California Consumer Privacy\nAct (CCPA), to understand the legal requirements concerning sensitive\ninformation in logs. Second, we conduct a systematic literature review to\nidentify common privacy attributes and practices in log anonymization,\nrevealing gaps in existing approaches. Finally, we survey 45 industry\nprofessionals to capture practical insights on log anonymization practices. Our\nfindings shed light on various perspectives of log privacy and reveal industry\nchallenges, such as technical and efficiency issues while highlighting the need\nfor standardized guidelines. By combining insights from regulatory, academic,\nand industry perspectives, our study aims to provide a clearer framework for\nidentifying and protecting sensitive information in software logs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs.

查看原文本刊更多论文

日志中敏感信息的实证研究

软件系统运行期间生成的软件日志对于异常检测和故障诊断等各种开发和分析活动至关重要。然而，这些日志中存在的敏感信息带来了严重的隐私问题，尤其是个人身份信息（PII）和可能导致重新识别风险的准标识符。虽然对一般数据隐私已经进行了广泛研究，但对软件日志隐私这一特定领域的研究仍然不足，对敏感性的定义不一致，也缺乏标准化的匿名化指南。为了缩小这一差距，本研究从多个角度对软件日志中的隐私进行了全面分析。我们首先对 25 个公开可用的日志数据集进行了分析，以确定潜在的敏感属性。在此基础上，我们重点从隐私法规、研究文献和行业实践三个角度进行分析。首先，我们分析了主要的数据隐私法规，如《一般数据保护条例》（GDPR）和《加州消费者隐私法案》（CCPA），以了解有关日志中敏感信息的法律要求。其次，我们进行了系统的文献综述，以确定日志匿名化中常见的隐私属性和实践，揭示现有方法中的不足。最后，我们对 45 位行业专业人士进行了调查，以获取有关日志匿名化实践的实用见解。我们的调查结果揭示了日志隐私的各种观点，揭示了行业面临的挑战，如技术和效率问题，同时强调了标准化指南的必要性。我们的研究结合了监管、学术和行业的观点，旨在为识别和保护软件日志中的敏感信息提供一个更清晰的框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量