Content-Based Textual File Type Detection at Scale

2021 13th International Conference on Machine Learning and Computing Pub Date : 2021-01-21 DOI:10.1145/3457682.3457756

Francesca Del Bonifro, M. Gabbrielli, Stefano Zacchiroli

{"title":"Content-Based Textual File Type Detection at Scale","authors":"Francesca Del Bonifro, M. Gabbrielli, Stefano Zacchiroli","doi":"10.1145/3457682.3457756","DOIUrl":null,"url":null,"abstract":"Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model reaches ≈ 85% in our experiments for a relatively high number of recognized classes (more than 130 file types).","PeriodicalId":142045,"journal":{"name":"2021 13th International Conference on Machine Learning and Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3457682.3457756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model reaches ≈ 85% in our experiments for a relatively high number of recognized classes (more than 130 file types).

查看原文本刊更多论文

大规模基于内容的文本文件类型检测

在分析大型源代码库时，编程语言检测是一种常见的需求。许多现有的工具都支持它，这些工具依赖于几个特性，尤其是文件扩展名来确定文件类型。我们考虑的问题是准确地检测软件代码库中常见的文件类型，仅基于文本文件内容。这样做有助于对缺乏文件扩展名的源代码进行分类(例如，发布在Web上的代码片段或可执行脚本)，以避免对使用错误或不常见的文件扩展名记录的源代码进行错误分类，并且还揭示了源代码文件的内在可识别性。我们提出了一个简单的模型:(a)对文本文件使用语言无关的词标记器，(b)将标记按1-/2-g分组，(c)基于N-gram频率构建特征向量，(d)使用简单的全连接神经网络作为分类器。作为训练集，我们使用从GitHub存储库中提取的文本文件，至少有1000个星星，使用现有的文件扩展名作为基础事实。尽管它很简单，但在我们的实验中，对于相对较多的可识别类(超过130个文件类型)，所提出的模型达到了≈85%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 13th International Conference on Machine Learning and Computing

自引率

0.00%

发文量