Code stylometry vs formatting and minification

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2024-09-06 DOI:10.7717/peerj-cs.2142

Stefano Balla, Maurizio Gabbrielli, Stefano Zacchiroli

{"title":"Code stylometry vs formatting and minification","authors":"Stefano Balla, Maurizio Gabbrielli, Stefano Zacchiroli","doi":"10.7717/peerj-cs.2142","DOIUrl":null,"url":null,"abstract":"The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"47 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2142","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

查看原文本刊更多论文

代码样式与格式化和最小化的比较

近年来，由于基于机器学习的作者识别技术不断进步，根据代码作者的编程风格对其进行自动识别（称为作者归属或代码风格测量）已成为可能。代码风格测量法一旦在规模上可行，就可用于善意或恶意的活动，包括：识别代码中最专业的同事（如果作者信息丢失）；对开源开发人员进行指纹识别，以便向他们主动提供工作机会；对非法软件的开发人员进行去匿名化，以便对他们进行追捕。根据各自的目标，利益相关者都希望提高或降低代码风格测量的效率。为了给这些决策提供信息，我们研究了代码风格测量的准确性如何受到两种常见软件开发活动的影响：代码格式化和代码精简。我们使用基于 code2vec 的作者分类器，对输入源文件的具体语法树（CST）表示法，对来自 Google Code Jam 数据集（59 位作者）的 Python 代码进行了代码风格测量。我们同时使用 CST 和 AST（抽象语法树）进行实验。我们比较了各自的分类准确率：(1) 原始数据集；(2) 使用 Black 格式化的数据集；(3) 使用 Python Minifier 简化的数据集。我们的结果表明(1) 基于 CST 的文体测量法比基于 AST 的文体测量法表现更好（51.00%→68%），(2) 代码格式化使代码文体测量法的准确率大幅下降（15%）（68%→53%），而最小化又进一步降低了 3%（68%→50%）。虽然代码格式化和最小化的准确性都有显著下降，但都不足以使开发人员无法通过代码风格测量进行识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.