The sampling threat when mining generalizable inter-library usage patterns

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Science of Computer Programming Pub Date : 2025-09-27 DOI:10.1016/j.scico.2025.103393

Yunior Pacheco Correa , Coen De Roover , Johannes Härtel

{"title":"The sampling threat when mining generalizable inter-library usage patterns","authors":"Yunior Pacheco Correa , Coen De Roover , Johannes Härtel","doi":"10.1016/j.scico.2025.103393","DOIUrl":null,"url":null,"abstract":"<div><div>Tool support in software engineering often relies on relationships, regularities, patterns, or rules mined from other users’ code. Examples include approaches to bug prediction, code recommendation, and code autocompletion. Mining is typically performed on samples of code rather than the entirety of available software projects. While sampling is crucial for scaling data analysis, it can affect the generalization of the mined patterns.</div><div>This paper focuses on sampling software projects filtered for specific libraries and frameworks, and on mining patterns that connect different libraries. We call these inter-library patterns. We observe that limiting the sample to a specific library may hinder the generalization of inter-library patterns, posing a threat to their use or interpretation. Using a simulation and a real case study, we show this threat for different sampling methods. Our simulation shows that only when sampling for the disjunction of both libraries involved in the implication of a pattern, the implication generalizes well. Additionally, we show that real empirical data sampled using the GitHub search API does not behave as expected from our simulation. This identifies a potential threat relevant for many studies that use the GitHub search API for studying inter-library patterns.</div></div>","PeriodicalId":49561,"journal":{"name":"Science of Computer Programming","volume":"248 ","pages":"Article 103393"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of Computer Programming","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167642325001327","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Tool support in software engineering often relies on relationships, regularities, patterns, or rules mined from other users’ code. Examples include approaches to bug prediction, code recommendation, and code autocompletion. Mining is typically performed on samples of code rather than the entirety of available software projects. While sampling is crucial for scaling data analysis, it can affect the generalization of the mined patterns.

This paper focuses on sampling software projects filtered for specific libraries and frameworks, and on mining patterns that connect different libraries. We call these inter-library patterns. We observe that limiting the sample to a specific library may hinder the generalization of inter-library patterns, posing a threat to their use or interpretation. Using a simulation and a real case study, we show this threat for different sampling methods. Our simulation shows that only when sampling for the disjunction of both libraries involved in the implication of a pattern, the implication generalizes well. Additionally, we show that real empirical data sampled using the GitHub search API does not behave as expected from our simulation. This identifies a potential threat relevant for many studies that use the GitHub search API for studying inter-library patterns.

查看原文本刊更多论文

挖掘通用库间使用模式时的抽样威胁

软件工程中的工具支持通常依赖于从其他用户代码中挖掘的关系、规则、模式或规则。示例包括bug预测、代码推荐和代码自动完成的方法。挖掘通常在代码样本上执行，而不是在整个可用的软件项目上执行。虽然采样对于扩展数据分析至关重要，但它会影响挖掘模式的泛化。本文关注的是为特定库和框架筛选的软件项目抽样，以及挖掘连接不同库的模式。我们称之为库间模式。我们观察到，将样本限制在特定的库中可能会阻碍库间模式的推广，对它们的使用或解释构成威胁。通过模拟和实际案例研究，我们展示了不同采样方法的这种威胁。我们的模拟表明，只有在对两个库的分离进行采样时，隐含的模式才能很好地泛化。此外，我们表明，使用GitHub搜索API采样的真实经验数据并不像我们的模拟所期望的那样。这对许多使用GitHub搜索API研究库间模式的研究来说是一个潜在的威胁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Science of Computer Programming 工程技术-计算机：软件工程

CiteScore

3.80

自引率

0.00%

发文量

审稿时长

67 days

期刊介绍： Science of Computer Programming is dedicated to the distribution of research results in the areas of software systems development, use and maintenance, including the software aspects of hardware design. The journal has a wide scope ranging from the many facets of methodological foundations to the details of technical issues andthe aspects of industrial practice. The subjects of interest to SCP cover the entire spectrum of methods for the entire life cycle of software systems, including • Requirements, specification, design, validation, verification, coding, testing, maintenance, metrics and renovation of software; • Design, implementation and evaluation of programming languages; • Programming environments, development tools, visualisation and animation; • Management of the development process; • Human factors in software, software for social interaction, software for social computing; • Cyber physical systems, and software for the interaction between the physical and the machine; • Software aspects of infrastructure services, system administration, and network management.