Code stylometry vs formatting and minification

IF 3.5 4区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Stefano Balla, Maurizio Gabbrielli, Stefano Zacchiroli
{"title":"Code stylometry vs formatting and minification","authors":"Stefano Balla, Maurizio Gabbrielli, Stefano Zacchiroli","doi":"10.7717/peerj-cs.2142","DOIUrl":null,"url":null,"abstract":"The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"47 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2142","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.
代码样式与格式化和最小化的比较
近年来,由于基于机器学习的作者识别技术不断进步,根据代码作者的编程风格对其进行自动识别(称为作者归属或代码风格测量)已成为可能。代码风格测量法一旦在规模上可行,就可用于善意或恶意的活动,包括:识别代码中最专业的同事(如果作者信息丢失);对开源开发人员进行指纹识别,以便向他们主动提供工作机会;对非法软件的开发人员进行去匿名化,以便对他们进行追捕。根据各自的目标,利益相关者都希望提高或降低代码风格测量的效率。为了给这些决策提供信息,我们研究了代码风格测量的准确性如何受到两种常见软件开发活动的影响:代码格式化和代码精简。我们使用基于 code2vec 的作者分类器,对输入源文件的具体语法树(CST)表示法,对来自 Google Code Jam 数据集(59 位作者)的 Python 代码进行了代码风格测量。我们同时使用 CST 和 AST(抽象语法树)进行实验。我们比较了各自的分类准确率:(1) 原始数据集;(2) 使用 Black 格式化的数据集;(3) 使用 Python Minifier 简化的数据集。我们的结果表明(1) 基于 CST 的文体测量法比基于 AST 的文体测量法表现更好(51.00%→68%),(2) 代码格式化使代码文体测量法的准确率大幅下降(15%)(68%→53%),而最小化又进一步降低了 3%(68%→50%)。虽然代码格式化和最小化的准确性都有显著下降,但都不足以使开发人员无法通过代码风格测量进行识别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
PeerJ Computer Science
PeerJ Computer Science Computer Science-General Computer Science
CiteScore
6.10
自引率
5.30%
发文量
332
审稿时长
10 weeks
期刊介绍: PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信