Content-Based Textual File Type Detection at Scale

Francesca Del Bonifro, M. Gabbrielli, Stefano Zacchiroli
{"title":"Content-Based Textual File Type Detection at Scale","authors":"Francesca Del Bonifro, M. Gabbrielli, Stefano Zacchiroli","doi":"10.1145/3457682.3457756","DOIUrl":null,"url":null,"abstract":"Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model reaches ≈ 85% in our experiments for a relatively high number of recognized classes (more than 130 file types).","PeriodicalId":142045,"journal":{"name":"2021 13th International Conference on Machine Learning and Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3457682.3457756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model reaches ≈ 85% in our experiments for a relatively high number of recognized classes (more than 130 file types).
大规模基于内容的文本文件类型检测
在分析大型源代码库时,编程语言检测是一种常见的需求。许多现有的工具都支持它,这些工具依赖于几个特性,尤其是文件扩展名来确定文件类型。我们考虑的问题是准确地检测软件代码库中常见的文件类型,仅基于文本文件内容。这样做有助于对缺乏文件扩展名的源代码进行分类(例如,发布在Web上的代码片段或可执行脚本),以避免对使用错误或不常见的文件扩展名记录的源代码进行错误分类,并且还揭示了源代码文件的内在可识别性。我们提出了一个简单的模型:(a)对文本文件使用语言无关的词标记器,(b)将标记按1-/2-g分组,(c)基于N-gram频率构建特征向量,(d)使用简单的全连接神经网络作为分类器。作为训练集,我们使用从GitHub存储库中提取的文本文件,至少有1000个星星,使用现有的文件扩展名作为基础事实。尽管它很简单,但在我们的实验中,对于相对较多的可识别类(超过130个文件类型),所提出的模型达到了≈85%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信