{"title":"A historical dataset for the Gnome ecosystem","authors":"M. Goeminne, Maëlick Claes, T. Mens","doi":"10.1109/MSR.2013.6624032","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624032","url":null,"abstract":"We present a dataset of the open source software ecosystem Gnome from a social point of view. We have collected historical data about the contributors to all Gnome projects stored on git.gnome.org, taking into account the problem of identity matching, and associating different activity types to the contributors. This type of information is very useful to complement the traditional, source-code related information one can obtain by mining and analyzing the actual source code. The dataset can be obtained at https://bitbucket.org/mgoeminne/sgl-flossmetric-dbmerge.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129181869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Will my patch make it? And how fast? Case study on the Linux kernel","authors":"Yujuan Jiang, Bram Adams, D. Germán","doi":"10.1109/MSR.2013.6624016","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624016","url":null,"abstract":"The Linux kernel follows an extremely distributed reviewing and integration process supported by 130 developer mailing lists and a hierarchy of dozens of Git repositories for version control. Since not every patch can make it and of those that do, some patches require a lot more reviewing and integration effort than others, developers, reviewers and integrators need support for estimating which patches are worthwhile to spend effort on and which ones do not stand a chance. This paper crosslinks and analyzes eight years of patch reviews from the kernel mailing lists and committed patches from the Git repository to understand which patches are accepted and how long it takes those patches to get to the end user. We found that 33% of the patches makes it into a Linux release, and that most of them need 3 to 6 months for this. Furthermore, that patches developed by more experienced developers are more easily accepted and faster reviewed and integrated. Additionally, reviewing time is impacted by submission time, the number of affected subsystems by the patch and the number of requested reviewers.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132414188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jue Wang, Yingnong Dang, Hongyu Zhang, Kai Chen, Tao Xie, D. Zhang
{"title":"Mining succinct and high-coverage API usage patterns from source code","authors":"Jue Wang, Yingnong Dang, Hongyu Zhang, Kai Chen, Tao Xie, D. Zhang","doi":"10.1109/MSR.2013.6624045","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624045","url":null,"abstract":"During software development, a developer often needs to discover specific usage patterns of Application Programming Interface (API) methods. However, these usage patterns are often not well documented. To help developers to get such usage patterns, there are approaches proposed to mine client code of the API methods. However, they lack metrics to measure the quality of the mined usage patterns, and the API usage patterns mined by the existing approaches tend to be many and redundant, posing significant barriers for being practical adoption. To address these issues, in this paper, we propose two quality metrics (succinctness and coverage) for mined usage patterns, and further propose a novel approach called Usage Pattern Miner (UP-Miner) that mines succinct and high-coverage usage patterns of API methods from source code. We have evaluated our approach on a large-scale Microsoft codebase. The results show that our approach is effective and outperforms an existing representative approach MAPO. The user studies conducted with Microsoft developers confirm the usefulness of the proposed approach in practice.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122205225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The GHTorent dataset and tool suite","authors":"Georgios Gousios","doi":"10.1109/MSR.2013.6624034","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624034","url":null,"abstract":"During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve high-quality, interconnected data. The GHTorent project has been collecting data for all public projects available on Github for more than a year. In this paper, we present the dataset details and construction process and outline the challenges and research opportunities emerging from it.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130319727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Hemmati, Sarah Nadi, Olga Baysal, Oleksii Kononenko, Wei Wang, Reid Holmes, Michael W. Godfrey
{"title":"The MSR Cookbook: Mining a decade of research","authors":"H. Hemmati, Sarah Nadi, Olga Baysal, Oleksii Kononenko, Wei Wang, Reid Holmes, Michael W. Godfrey","doi":"10.1109/MSR.2013.6624048","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624048","url":null,"abstract":"The Mining Software Repositories (MSR) research community has grown significantly since the first MSR workshop was held in 2004. As the community continues to broaden its scope and deepens its expertise, it is worthwhile to reflect on the best practices that our community has developed over the past decade of research. We identify these best practices by surveying past MSR conferences and workshops. To that end, we review all 117 full papers published in the MSR proceedings between 2004 and 2012. We extract 268 comments from these papers, and categorize them using a grounded theory methodology. From this evaluation, four high-level themes were identified: data acquisition and preparation, synthesis, analysis, and sharing/replication. Within each theme we identify several common recommendations, and also examine how these recommendations have evolved over the past decade. In an effort to make this survey a living artifact, we also provide a public forum that contains the extracted recommendations in the hopes that the MSR community can engage in a continuing discussion on our evolving best practices.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133088959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assisting code search with automatic Query Reformulation for bug localization","authors":"Bunyamin Sisman, A. Kak","doi":"10.1109/MSR.2013.6624044","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624044","url":null,"abstract":"Source code retrieval plays an important role in many software engineering tasks. However, designing a query that can accurately retrieve the relevant software artifacts can be challenging for developers as it requires a certain level of knowledge and experience regarding the code base. This paper demonstrates how the difficulty of designing a proper query can be alleviated through automatic Query Reformulation (QR) - an under-the-hood operation for reformulating a user's query with no additional input from the user. The proposed QR framework works by enriching a user's search query with certain specific additional terms drawn from the highest-ranked artifacts retrieved in response to the initial query. The important point here is that these additional terms injected into a query are those that are deemed to be “close” to the original query terms in the source code on the basis of positional proximity. This similarity metric is based on the notion that terms that deal with the same concepts in source code are usually proximal to one another in the same files. We demonstrate the superiority of our QR framework in relation to the QR frameworks well-known in the natural language document retrieval by showing significant improvements in bug localization performance for two large software projects using more than 4,000 queries.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124336448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replicating mining studies with SOFAS","authors":"Giacomo Ghezzi, H. Gall","doi":"10.1109/MSR.2013.6624050","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624050","url":null,"abstract":"The replication of studies in mining software repositories (MSR) is essential to compare different mining techniques or assess their findings across many projects. However, it has been shown that very few of these studies can be easily replicated. Their replication is just as fundamental as the studies themselves and is one of the main threats to validity that they suffer from. In this paper, we show how we can alleviate this problem with our SOFAS framework. SOFAS is a platform that enables a systematic and repeatable analysis of software projects by providing extensible and composable analysis workflows. These workflows can be applied on a multitude of software projects, facilitating the replication and scaling of mining studies. In this paper, we show how and to which degree replication can be achieved. We investigated the mining studies of MSR from 2004 to 2011 and found that from 88 studies published in the MSR proceedings so far, we can fully replicate 25 empirical studies. Additionally, we can replicate 27 additional studies to a large extent. These studies account for 30% and 32%, respectively, of the mining studies published. To support our claim we describe in detail one large study that we replicated and discuss how replication with SOFAS works for the other studies investigated. To discuss the potential of our platform we also characterise how studies can be easily enriched to deliver even more comprehensive answers by extending the analysis workflows provided by the platform.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115263985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A discriminative model approach for suggesting tags automatically for Stack Overflow questions","authors":"A. Saha, Ripon K. Saha, Kevin A. Schneider","doi":"10.1109/MSR.2013.6624009","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624009","url":null,"abstract":"Annotating documents with keywords or `tags' is useful for categorizing documents and helping users find a document efficiently and quickly. Question and answer (Q&A) sites also use tags to categorize questions to help ensure that their users are aware of questions related to their areas of expertise or interest. However, someone asking a question may not necessarily know the best way to categorize or tag the question, and automatically tagging or categorizing a question is a challenging task. Since a Q&A site may host millions of questions with tags and other data, this information can be used as a training and test dataset for approaches that automatically suggest tags for new questions. In this paper, we mine data from millions of questions from the Q&A site Stack Overflow, and using a discriminative model approach, we automatically suggest question tags to help a questioner choose appropriate tags for eliciting a response.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127073951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ripon K. Saha, C. Roy, Kevin A. Schneider, D. Perry
{"title":"Understanding the evolution of Type-3 clones: An exploratory study","authors":"Ripon K. Saha, C. Roy, Kevin A. Schneider, D. Perry","doi":"10.1109/MSR.2013.6624021","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624021","url":null,"abstract":"Understanding the evolution of clones is important both for understanding the maintenance implications of clones and building a robust clone management system. To this end, researchers have already conducted a number of studies to analyze the evolution of clones, mostly focusing on Type-1 and Type-2 clones. However, although there are a significant number of Type-3 clones in software systems, we know a little how they actually evolve. In this paper, we perform an exploratory study on the evolution of Type-1, Type-2, and Type-3 clones in six open source software systems written in two different programming languages and compare the result with a previous study to better understand the evolution of Type-3 clones. Our results show that although Type-3 clones are more likely to change inconsistently, the absolute number of consistently changed Type-3 clone classes is higher than that of Type-1 and Type-2. Type-3 clone classes also have a lifespan similar to that of Type-1 and Type-2 clones. In addition, a considerable number of Type-1 and Type-2 clones convert into Type-3 clones during evolution. Therefore, it is important to manage type-3 clones properly to limit their negative impact. However, various automated clone management techniques such as notifying developers about clone changes or linked editing should be chosen carefully due to the inconsistent nature of Type-3 clones.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125972097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do software categories impact coupling metrics?","authors":"L. B. L. Souza, M. Maia","doi":"10.1109/MSR.2013.6624030","DOIUrl":"https://doi.org/10.1109/MSR.2013.6624030","url":null,"abstract":"Software metrics is a valuable mechanism to assess the quality of software systems. Metrics can help the automated analysis of the growing data available in software repositories. Coupling metrics is a kind of software metrics that have been extensively used since the seventies to evaluate several software properties related to maintenance, evolution and reuse tasks. For example, several works have shown that we can use coupling metrics to assess the reusability of software artifacts available in repositories. However, thresholds for software metrics to indicate adequate coupling levels are still a matter of discussion. In this paper, we investigate the impact of software categories on the coupling level of software systems. We have found that different categories may have different levels of coupling, suggesting that we need special attention when comparing software systems in different categories and when using predefined thresholds already available in the literature.","PeriodicalId":325271,"journal":{"name":"2013 10th Working Conference on Mining Software Repositories (MSR)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126383307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}