{"title":"50K-C: A Dataset of Compilable, and Compiled, Java Projects","authors":"Pedro Martins, Rohan Achar, C. Lopes","doi":"10.1145/3196398.3196450","DOIUrl":"https://doi.org/10.1145/3196398.3196450","url":null,"abstract":"We provide a repository of 50,000 compilable Java projects. Each project in this dataset comes with references to all the dependencies required to compile it, the resulting bytecode, and the scripts with which the projects were built. The dependencies and the build scripts provide a mechanism to re-create compilation of the projects, if needed (to instruct source code for bytecode analysis, for example). The bytecode is ready for testing, execution, and dynamic analysis tools.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"38 1","pages":"1-5"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84978864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinqiu Yang, Erik Wittern, Annie T. T. Ying, Julian T Dolby, Lin Tan
{"title":"Towards Extracting Web API Specifications from Documentation","authors":"Jinqiu Yang, Erik Wittern, Annie T. T. Ying, Julian T Dolby, Lin Tan","doi":"10.1145/3196398.3196411","DOIUrl":"https://doi.org/10.1145/3196398.3196411","url":null,"abstract":"Web API specifications are machine-readable descriptions of APIs. These specifications, in combination with related tooling, simplify and support the consumption of APIs. However, despite the increased distribution of web APIs, specifications are rare and their creation and maintenance heavily rely on manual efforts by third parties. In this paper, we propose an automatic approach and an associated tool called D2Spec for extracting significant parts of such specifications from web API documentation pages. Given a seed online documentation page of an API, D2Spec first crawls all documentation pages on the API, and then uses a set of machine-learning techniques to extract the base URL, path templates, and HTTP methods – collectively describing the endpoints of the API. We evaluate whether D2Spec can accurately extract endpoints from documentation on 116 web APIs. The results show that D2Spec achieves a precision of 87.1% in identifying base URLs, a precision of 80.3% and a recall of 80.9% in generating path templates, and a precision of 83.8% and a recall of 77.2% in extracting HTTP methods. In addition, in an evaluation on 64 APIs with pre-existing API specifications, D2Spec revealed many inconsistencies between web API documentation and their corresponding publicly available specifications. API consumers would benefit from D2Spec pointing them to, and allowing them thus to fix, such inconsistencies.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"20 7-8 1","pages":"454-464"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78173332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enriched Event Streams: A General Dataset for Empirical Studies on In-IDE Activities of Software Developers","authors":"Sebastian Proksch","doi":"10.1145/3196398.3196400","DOIUrl":"https://doi.org/10.1145/3196398.3196400","url":null,"abstract":"Developers have been the subject of many empirical studies over the years. To assist developers in their everyday work, an understanding of their activities is necessary, especially how they develop source code. Unfortunately, conducting such studies is very expensive and researchers often resort to studying artifacts after the fact. To pave the road for future empirical studies on developer activities, we built FeedBaG, a general-purpose interaction tracker for Visual Studio that monitors development activities. The observations are stored in enriched event streams that encode a holistic picture of the in-IDE development process. Enriched event streams capture all commands invoked in the IDE with additional context information, such as the test being run or the accompanying fine-grained code edits. We used FeedBaG to collect enriched event streams from 81 developers. Over 1,527 days, we collected more than 11M events that correspond to 15K hours of working time.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"337 1","pages":"62-65"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75073101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ridhi Jain, Sai Prathik, Venkatesh Vinayakarao, Rahul Purandare
{"title":"A Search System for Mathematical Expressions on Software Binaries","authors":"Ridhi Jain, Sai Prathik, Venkatesh Vinayakarao, Rahul Purandare","doi":"10.1145/3196398.3196413","DOIUrl":"https://doi.org/10.1145/3196398.3196413","url":null,"abstract":"Developers often ask for libraries that implement specific mathematical expressions. A fundamental bottleneck in building information retrieval (IR) systems to answer such mathematical queries is the inability to detect a given expression in software binaries. While we have a few math IR solutions such as EgoMath2 and Tangent-3 that work over text documents, none exist to search over software binaries. Our vision is to build a search system for binaries to answer queries containing mathematical expressions. A wide variety of compilers and differences in the way they optimize the code, pose difficult challenges to solve this problem. In this work, we discuss our preliminary results in detecting mathematical expressions in software binaries. We use a knowledge base assisted approach to solve this problem. We are able to search mathematical expressions with a precision of 80% and a recall of 53%. This work opens up interesting research opportunities in areas such as software security and performance, to help analysts in identifying and analyzing binaries for implementations of mathematical expressions.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"64 1","pages":"487-491"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75204872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Ott, Abigail Atchison, Paul Harnack, Adrienne Bergh, Erik J. Linstead
{"title":"A Deep Learning Approach to Identifying Source Code in Images and Video","authors":"J. Ott, Abigail Atchison, Paul Harnack, Adrienne Bergh, Erik J. Linstead","doi":"10.1145/3196398.3196402","DOIUrl":"https://doi.org/10.1145/3196398.3196402","url":null,"abstract":"While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"266 1","pages":"376-386"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77159680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paolo Calciati, Konstantin Kuznetsov, Xue-Wei Bai, Alessandra Gorla
{"title":"What did Really Change with the New Release of the App?","authors":"Paolo Calciati, Konstantin Kuznetsov, Xue-Wei Bai, Alessandra Gorla","doi":"10.1145/3196398.3196449","DOIUrl":"https://doi.org/10.1145/3196398.3196449","url":null,"abstract":"The mobile app market is evolving at a very fast pace. In order to stay in the market and fulfill user's growing demands, developers have to continuously update their apps either to fix issues or to add new features. Users and market managers may have a hard time understanding what really changed in a new release though, and therefore may not make an informative guess of whether updating the app is recommendable, or whether it may pose new security and privacy threats for the user. We propose a ready-to-use framework to analyze the evolution of Android apps. Our framework extracts and visualizes various information -such as how an app uses sensitive data, which third-party libraries it relies on, which URLs it connects to, etc.- and combines it to create a comprehensive report on how the app evolved. Besides, we present the results of an empirical study on 235 applications with at least 50 releases using our framework. Our analysis reveals that Android apps tend to have more leaks of sensitive data over time, and that the majority of API calls relative to dangerous permissions are added to the code in releases posterior to the one where the corresponding permission was requested.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"10 1","pages":"142-152"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90114642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rahul Amlekar, Andrés Felipe Rincón Gamboa, Keheliya Gallaba, Shane McIntosh
{"title":"Do Software Engineers Use Autocompletion Features Differently than Other Developers?","authors":"Rahul Amlekar, Andrés Felipe Rincón Gamboa, Keheliya Gallaba, Shane McIntosh","doi":"10.1145/3196398.3196471","DOIUrl":"https://doi.org/10.1145/3196398.3196471","url":null,"abstract":"Autocomplete is a common workspace feature that is used to recommend code snippets as developers type in their IDEs. Users of autocomplete features no longer need to remember programming syntax and the names and details of the API methods that are needed to accomplish tasks. Moreover, autocompletion of code snippets may have an accelerating effect, lowering the number of keystrokes that are needed to type the code. However, like any tool, implicit tendencies of users may emerge. Knowledge of how developers in different roles use autocompletion features may help to guide future autocompletion development, research, and training material. In this paper, we set out to better understand how usage of autocompletion varies among software engineers and other developers (i.e., academic researchers, industry researchers, hobby programmers, and students). Analysis of autocompletion events in the Mining Software Repositories (MSR) challenge dataset reveals that: (1) rates of autocompletion usage among software engineers and other developers are not significantly different; and (2) although several non-negligible effect sizes of autocompletion targets (e.g., local variables, method names) are detected between the two groups, the rates at which these targets appear do not vary to a significant degree. These inconclusive results are likely due to the small sample size (n = 35); however, they do provide an interesting insight for future studies to build upon.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"36 1","pages":"86-89"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91008489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asher Trockman, Keenen Cates, Mark Mozina, T. Nguyen, Christian Kästner, Bogdan Vasilescu
{"title":"\"Automatically Assessing Code Understandability\" Reanalyzed: Combined Metrics Matter","authors":"Asher Trockman, Keenen Cates, Mark Mozina, T. Nguyen, Christian Kästner, Bogdan Vasilescu","doi":"10.1145/3196398.3196441","DOIUrl":"https://doi.org/10.1145/3196398.3196441","url":null,"abstract":"Previous research shows that developers spend most of their time understanding code. Despite the importance of code understandability for maintenance-related activities, an objective measure of it remains an elusive goal. Recently, Scalabrino et al. reported on an experiment with 46 Java developers designed to evaluate metrics for code understandability. The authors collected and analyzed data on more than a hundred features describing the code snippets, the developers' experience, and the developers' performance on a quiz designed to assess understanding. They concluded that none of the metrics considered can individually capture understandability. Expecting that understandability is better captured by a combination of multiple features, we present a reanalysis of the data from the Scalabrino et al. study, in which we use different statistical modeling techniques. Our models suggest that some computed features of code, such as those arising from syntactic structure and documentation, have a small but significant correlation with understandability. Further, we construct a binary classifier of understandability based on various interpretable code features, which has a small amount of discriminating power. Our encouraging results, based on a small data set, suggest that a useful metric of understandability could feasibly be created, but more data is needed.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"314-318"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83520538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guillermo De la Torre, R. Robbes, Alexandre Bergel
{"title":"Imprecisions Diagnostic in Source Code Deltas","authors":"Guillermo De la Torre, R. Robbes, Alexandre Bergel","doi":"10.1145/3196398.3196404","DOIUrl":"https://doi.org/10.1145/3196398.3196404","url":null,"abstract":"Beyond a practical use in code review, source code change detection (SCCD) is an important component of many mining software repositories (MSR) approaches. As such, any error or imprecision in the detection may result in a wrong conclusion while mining repositories. We identified, analyzed, and characterized impressions in GumTree, which is the most advanced algorithm for SCCD. After analyzing its detection accuracy over a curated corpus of 107 C# projects, we diagnosed several imprecisions. Many of our findings confirm that a more language-aware perspective of GumTree can be helpful in reporting more precise changes.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"492-502"},"PeriodicalIF":0.0,"publicationDate":"2018-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81463111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig
{"title":"Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow","authors":"Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig","doi":"10.1145/3196398.3196408","DOIUrl":"https://doi.org/10.1145/3196398.3196408","url":null,"abstract":"For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"19 1","pages":"476-486"},"PeriodicalIF":0.0,"publicationDate":"2018-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74050259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}