Processing Large Datasets of Fined Grained Source Code Changes

2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2019-09-01 DOI:10.1109/ICSME.2019.00064

S. Levin, A. Yehudai

引用次数: 0

Abstract

In the era of Big Code, when researchers seek to study an increasingly large number of repositories to support their findings, the data processing stage may require manipulating millions and more of records. In this work we focus on studies involving fine-grained AST level source code changes. We present how we extended the CodeDistillery source code mining framework with data manipulation capabilities, aimed to alleviate the processing of large datasets of fine grained source code changes. The capabilities we have introduced allow researchers to highly automate their repository mining process and streamline the data acquisition and processing phases. These capabilities have been successfully used to conduct a number of studies, in the course of which dozens of millions of fine-grained source code changes have been processed.

查看原文本刊更多论文

处理细粒度源代码更改的大型数据集

在大代码时代，当研究人员试图研究越来越多的存储库来支持他们的发现时，数据处理阶段可能需要操纵数百万甚至更多的记录。在这项工作中，我们专注于涉及细粒度AST级别源代码更改的研究。我们介绍了如何用数据操作功能扩展CodeDistillery源代码挖掘框架，旨在减轻处理细粒度源代码更改的大型数据集的工作量。我们介绍的功能允许研究人员高度自动化他们的存储库挖掘过程，并简化数据获取和处理阶段。这些功能已经被成功地用于进行大量的研究，在此过程中，已经处理了数千万个细粒度的源代码更改。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量