语法和堆栈溢出:一种提取语法错误语料库和修复的方法

2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2019-07-17 DOI:10.1109/ICSME.2019.00048

A. W. Wong, Amir Salimi, S. Chowdhury, Abram Hindle

{"title":"语法和堆栈溢出:一种提取语法错误语料库和修复的方法","authors":"A. W. Wong, Amir Salimi, S. Chowdhury, Abram Hindle","doi":"10.1109/ICSME.2019.00048","DOIUrl":null,"url":null,"abstract":"One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.","PeriodicalId":106748,"journal":{"name":"2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Syntax and Stack Overflow: A Methodology for Extracting a Corpus of Syntax Errors and Fixes\",\"authors\":\"A. W. Wong, Amir Salimi, S. Chowdhury, Abram Hindle\",\"doi\":\"10.1109/ICSME.2019.00048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.\",\"PeriodicalId\":106748,\"journal\":{\"name\":\"2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSME.2019.00048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME.2019.00048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

在研究如何发现和修复语法错误时，一个问题是如何获得自然的、有代表性的语法错误示例。大多数语法错误数据集不是免费的、开放的和公开的，或者它们是从新手程序员那里提取出来的，不代表一般开发人员会犯的语法错误。所有技能水平的程序员都会在Stack Overflow上发布问题和答案，其中可能包含源代码片段以及相应的文本和标签。许多代码片段不进行解析，因此它们可以形成语法错误和更正的语料库。我们的主要贡献是一种提取自然语法错误和相应的人为修复的方法，以帮助语法错误的研究。Python抽象语法树解析器用于确定从SOTorrent数据集中提取的代码块的初步错误和更正。通过在Python解释器中执行更正，我们进一步分析了代码。我们应用我们的方法生成了一个包含62965个Python Stack Overflow代码片段的公共数据集，这些代码片段带有相应的标签、错误和堆栈跟踪。我们发现Stack Overflow用户所犯的错误与学生开发人员或随机突变所犯的错误不匹配，这意味着该领域存在严重的代表性风险。最后，我们公开分享我们的数据集，以便未来的研究人员可以重用和扩展我们的语法错误和修复。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Syntax and Stack Overflow: A Methodology for Extracting a Corpus of Syntax Errors and Fixes

One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量