拆分、重命名、删除:Jupyter笔记本中常见清理活动的研究

2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) Pub Date : 2021-11-01 DOI:10.1109/ASEW52652.2021.00032

Helen Dong, Shurui Zhou, Jin L. C. Guo, Christian Kästner

{"title":"拆分、重命名、删除:Jupyter笔记本中常见清理活动的研究","authors":"Helen Dong, Shurui Zhou, Jin L. C. Guo, Christian Kästner","doi":"10.1109/ASEW52652.2021.00032","DOIUrl":null,"url":null,"abstract":"Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, he or she will have to dedicate time to clean up the code in order for others to easily understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who then can focus more on their actual work rather than the routine and tedious cleaning work. By sampling notebooks from GitHub and analyzing changes between subsequent commits, we identified common cleaning activities, such as changes to markdown (e.g., adding headers sections or descriptions) or comments (both deleting dead code and adding descriptions) as well as reordering cells. We also find that common cleaning activities differ depending on the intended purpose of the notebook. Our results provide a valuable foundation for tool builders and notebook users, as many identified cleaning activities could benefit from codification of best practices and dedicated tool support, possibly tailored depending on intended use.","PeriodicalId":349977,"journal":{"name":"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Splitting, Renaming, Removing: A Study of Common Cleaning Activities in Jupyter Notebooks\",\"authors\":\"Helen Dong, Shurui Zhou, Jin L. C. Guo, Christian Kästner\",\"doi\":\"10.1109/ASEW52652.2021.00032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, he or she will have to dedicate time to clean up the code in order for others to easily understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who then can focus more on their actual work rather than the routine and tedious cleaning work. By sampling notebooks from GitHub and analyzing changes between subsequent commits, we identified common cleaning activities, such as changes to markdown (e.g., adding headers sections or descriptions) or comments (both deleting dead code and adding descriptions) as well as reordering cells. We also find that common cleaning activities differ depending on the intended purpose of the notebook. Our results provide a valuable foundation for tool builders and notebook users, as many identified cleaning activities could benefit from codification of best practices and dedicated tool support, possibly tailored depending on intended use.\",\"PeriodicalId\":349977,\"journal\":{\"name\":\"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASEW52652.2021.00032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASEW52652.2021.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

数据科学家通常使用计算笔记本，因为它们为测试多个模型提供了良好的环境。然而，一旦科学家完成了代码并找到了理想的模型，他或她将不得不花时间清理代码，以便其他人能够轻松理解它。在本文中，我们对科学家如何清理他们的代码进行了定性研究，希望能够提出一种工具来自动化这个过程。我们的最终目标是让工具构建者解决可能存在的差距，并为数据科学家提供额外的帮助，这样他们就可以更多地专注于他们的实际工作，而不是例行的、繁琐的清理工作。通过从GitHub中取样笔记本并分析后续提交之间的更改，我们确定了常见的清理活动，例如markdown的更改(例如，添加标题部分或描述)或注释(删除死代码和添加描述)以及重新排序单元格。我们还发现，根据笔记本的预期用途，常见的清洁活动有所不同。我们的结果为工具构建者和笔记本用户提供了有价值的基础，因为许多确定的清理活动可以从最佳实践的编纂和专用工具支持中受益，可能根据预期用途进行定制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Splitting, Renaming, Removing: A Study of Common Cleaning Activities in Jupyter Notebooks

Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, he or she will have to dedicate time to clean up the code in order for others to easily understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who then can focus more on their actual work rather than the routine and tedious cleaning work. By sampling notebooks from GitHub and analyzing changes between subsequent commits, we identified common cleaning activities, such as changes to markdown (e.g., adding headers sections or descriptions) or comments (both deleting dead code and adding descriptions) as well as reordering cells. We also find that common cleaning activities differ depending on the intended purpose of the notebook. Our results provide a valuable foundation for tool builders and notebook users, as many identified cleaning activities could benefit from codification of best practices and dedicated tool support, possibly tailored depending on intended use.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)

自引率

0.00%

发文量