软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3569448
Robert Dyer, Samuel W. Flint
{"title":"Performing large-scale mining studies: from start to finish (tutorial)","authors":"Robert Dyer, Samuel W. Flint","doi":"10.1145/3540250.3569448","DOIUrl":"https://doi.org/10.1145/3540250.3569448","url":null,"abstract":"Modern software engineering research often relies on mining open-source software repositories, to either provide motivation for their research problems and/or evaluation of the proposed approach. Mining ultra-large-scale software repositories is still a difficult task, requiring substantial expertise and access to significant hardware. Tools such as Boa can help researchers easily mine large numbers of open-source repositories. There has also recently been more of a push toward open science, with an emphasis on making replication packages available. Building such replication packages incurs additional workload for researchers. In this tutorial, we teach how to use the Boa infrastructure for mining software repository data. We leverage Boa’s VS Code IDE extension to help write and submit Boa queries, and also leverage Boa’s study template to show how researchers can more easily analyze the output from Boa and automatically produce a suitable replication package that is published on Zenodo.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"141 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84733290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3558909
Deeksha M. Arya
{"title":"This is your cue! assisting search behaviour with resource style properties","authors":"Deeksha M. Arya","doi":"10.1145/3540250.3558909","DOIUrl":"https://doi.org/10.1145/3540250.3558909","url":null,"abstract":"When learning a software technology, programmers face a large variety of resources in different styles and catering to different requirements. Although search engines are helpful to filter relevant resources, programmers are still required to manually go through a number of resources before they find one pertinent to their needs. Prior work has largely concentrated on helping programmers find the precise location of relevant information within a resource. Our work focuses on helping programmers assess the pertinence of resources to differentiate between resources. We investigated how programmers find learning resources online via a diary and interview study, and observed that programmers use certain cues to determine whether to access a resource. Based on our findings, we investigate the extent to which we can support the cue-following process via a prototype tool. Our research supports programmers’ search behaviour for software technology learning resources to inform resource creators on important factors that programmers look for during their search.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84078013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An empirical study of log analysis at Microsoft","authors":"Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Kang, Minghua Ma, Yining Wei, Yingnong Dang, S. Rajmohan, Qingwei Lin","doi":"10.1145/3540250.3558963","DOIUrl":"https://doi.org/10.1145/3540250.3558963","url":null,"abstract":"Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as log parsing and log-based anomaly detection. However, the real voices from front-line practitioners are seldom heard. For example, what are the pain points of log analysis in practice? In this work, we conduct a comprehensive survey study on log analysis at Microsoft. We collected feedback from 105 employees through a questionnaire of 13 questions and individual interviews with 12 employees. We summarize the format, scenario, method, tool, and pain points of log analysis. Additionally, by comparing the industrial practices with academic research, we discuss the gaps between academia and industry, and future opportunities on log analysis with four inspiring findings. Particularly, we observe a huge gap exists between log anomaly detection research and failure alerting practices regarding the goal, technique, efficiency, etc. Moreover, data-driven log parsing, which has been widely studied in recent research, can be alternatively achieved by simply logging template IDs during software development. We hope this paper could uncover the real needs of industrial practitioners and the unnoticed yet significant gap between industry and academia, and inspire interesting future directions that converge efforts from both sides.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73281592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3549116
Rekha R. Pai, Abhishek Uppar, Akshatha Shenoy, Pranshul Kushwaha, D. D'Souza
{"title":"Static executes-before analysis for event driven programs","authors":"Rekha R. Pai, Abhishek Uppar, Akshatha Shenoy, Pranshul Kushwaha, D. D'Souza","doi":"10.1145/3540250.3549116","DOIUrl":"https://doi.org/10.1145/3540250.3549116","url":null,"abstract":"The executes-before relation between tasks is fundamental in the analysis of Event Driven Programs with several downstream applications like race detection and identifying redundant synchronizations. We present a sound, efficient, and effective static analysis technique to compute executes-before pairs of tasks for a general class of event driven programs. The analysis is based on a small but comprehensive set of rules evaluated on a novel structure called the task post graph of a program. We show how to use the executes-before information to identify disjoint-blocks in event driven programs and further use them to improve the precision of data race detection for these programs. We have implemented our analysis in the Flowdroid framework in a tool called AndRacer and evaluated it on several Android apps, bringing out the scalability, recall, and improved precision of the analyses","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85596315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3549173
Wen Li, Li Li, Haipeng Cai
{"title":"On the vulnerability proneness of multilingual code","authors":"Wen Li, Li Li, Haipeng Cai","doi":"10.1145/3540250.3549173","DOIUrl":"https://doi.org/10.1145/3540250.3549173","url":null,"abstract":"Software construction using multiple languages has long been a norm, yet it is still unclear if multilingual code construction has significant security implications and real security consequences. This paper aims to address this question with a large-scale study of popular multi-language projects on GitHub and their evolution histories, enabled by our novel techniques for multilingual code characterization. We found statistically significant associations between the proneness of multilingual code to vulnerabilities (in general and of specific categories) and its language selection. We also found this association is correlated with that of the language interfacing mechanism, not that of individual languages. We validated our statistical findings with in-depth case studies on actual vulnerabilities, explained via the mechanism and language selection. Our results call for immediate actions to assess and defend against multilingual vulnerabilities, for which we provide practical recommendations.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77916464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPINE: a scalable log parser with feedback guidance","authors":"Xuheng Wang, Xu Zhang, Liqun Li, Shilin He, Hongyu Zhang, Yudong Liu, Ling Zheng, Yu Kang, Qingwei Lin, Yingnong Dang, S. Rajmohan, Dongmei Zhang","doi":"10.1145/3540250.3549176","DOIUrl":"https://doi.org/10.1145/3540250.3549176","url":null,"abstract":"Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated log analysis techniques. Though existing log parsers have achieved promising accuracy on public log datasets, they still face many challenges when applied in the industry. Through studying the characteristics of real-world log data and analyzing the limitations of existing log parsers, we identify two problems. Firstly, it is non-trivial to scale a log parser to a vast number of logs, especially in real-world scenarios where the log data is extremely imbalanced. Secondly, existing log parsers overlook the importance of user feedback, which is imperative for parser fine-tuning under the continuous evolution of log data. To overcome the challenges, we propose SPINE, which is a highly scalable log parser with user feedback guidance. Based on our log parser equipped with initial grouping and progressive clustering,we propose a novel log data scheduling algorithm to improve the efficiency of parallelization under the large-scale imbalanced log data. Besides, we introduce user feedback to make the parser fast adapt to the evolving logs. We evaluated SPINE on 16 public log datasets. SPINE achieves more than 0.90 parsing accuracy on average with the highest parsing efficiency, which outperforms the state-of-the-art log parsers. We also evaluated SPINE in the production environment of Microsoft, in which SPINE can parse 30million logs in less than 8 minutes under 16 executors, achieving near real-time performance. In addition, our evaluations show that SPINE can consistently achieve good accuracy under log evolution with a moderate number of user feedback.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79187615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3558919
Haoxin Tu, Lingxiao Jiang, Xuhua Ding, He Jiang
{"title":"FastKLEE: faster symbolic execution via reducing redundant bound checking of type-safe pointers","authors":"Haoxin Tu, Lingxiao Jiang, Xuhua Ding, He Jiang","doi":"10.1145/3540250.3558919","DOIUrl":"https://doi.org/10.1145/3540250.3558919","url":null,"abstract":"Symbolic execution (SE) has been widely adopted for automatic program analysis and software testing. Many SE engines (e.g., KLEE or Angr) need to interpret certain Intermediate Representations (IR) of code during execution, which may be slow and costly. Although a plurality of studies proposed to accelerate SE, few of them consider optimizing the internal interpretation operations. In this paper, we propose FastKLEE, a faster SE engine that aims to speed up execution via reducing redundant bound checking of type-safe pointers during IR code interpretation. Specifically, in FastKLEE, a type inference system is first leveraged to classify pointer types (i.e., safe or unsafe) for the most frequently interpreted read/write instructions. Then, a customized memory operation is designed to perform bound checking for only the unsafe pointers and omit redundant checking on safe pointers. We implement FastKLEE on top of the well-known SE engine KLEE and combined it with the notable type inference system CCured. Evaluation results demonstrate that FastKLEE is able to reduce by up to 9.1% (5.6% on average) as the state-of-the-art approach KLEE in terms of the time to explore the same number (i.e., 10k) of execution paths. FastKLEE is opensourced at https://github.com/haoxintu/FastKLEE. A video demo of FastKLEE is available at https://youtu.be/fjV_a3kt-mo.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72833744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3549114
L. Grazia, Michael Pradel
{"title":"The evolution of type annotations in python: an empirical study","authors":"L. Grazia, Michael Pradel","doi":"10.1145/3540250.3549114","DOIUrl":"https://doi.org/10.1145/3540250.3549114","url":null,"abstract":"Type annotations and gradual type checkers attempt to reveal errors and facilitate maintenance in dynamically typed programming languages. Despite the availability of these features and tools, it is currently unclear how quickly developers are adopting them, what strategies they follow when doing so, and whether adding type annotations reveals more type errors. This paper presents the first large-scale empirical study of the evolution of type annotations and type errors in Python. The study is based on an analysis of 1,414,936 type annotation changes, which we extract from 1,123,393 commits among 9,655 projects. Our results show that (i) type annotations are getting more popular, and once added, often remain unchanged in the projects for a long time, (ii) projects follow three evolution patterns for type annotation usage -- regular annotation, type sprints, and occasional uses -- and that the used pattern correlates with the number of contributors, (iii) more type annotations help find more type errors (0.704 correlation), but nevertheless, many commits (78.3%) are committed despite having such errors. Our findings show that better developer training and automated techniques for adding type annotations are needed, as most code still remains unannotated, and they call for a better integration of gradual type checking into the development process.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"143 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74034077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3558907
Ahmed Khanfir
{"title":"Effective and scalable fault injection using bug reports and generative language models","authors":"Ahmed Khanfir","doi":"10.1145/3540250.3558907","DOIUrl":"https://doi.org/10.1145/3540250.3558907","url":null,"abstract":"Previous research has shown that artificial faults can be useful in many software engineering tasks such as testing, fault-tolerance assessment, debugging, dependability evaluation, risk analysis, etc. However, such artificial-fault-based applications can be questioned or inaccurate when the considered faults misrepresent real bugs. Since typically, fault injection techniques (i.e. mutation testing) produce a large number of faults by altering ”blindly” the code in arbitrary locations, they are unlikely capable to produce few but relevant real-like faults. In our work, we tackle this challenge by guiding the injection towards resembling bugs that have been previously introduced by developers. For this purpose, we propose iBiR, the first fault injection approach that leverages information from bug reports to inject ”realistic” faults. iBiR injects faults on the locations that are more likely to be related to a given bug-report by applying appropriate inverted fix-patterns, which are manually or automatically crafted by automated-program-repair researchers. We assess our approach using bugs from the Defects4J dataset and show that iBiR outperforms significantly conventional mutation testing in terms of injecting faults that semantically resemble and couple with real ones, in the vast majority of the cases. Similarly, the faults produced by iBiR give significantly better fault-tolerance estimates than conventional mutation testing in around 80% of the cases.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74095164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
软件产业与工程Pub Date : 2022-11-07DOI: 10.1145/3540250.3549149
Yanjie Jiang, Hui Liu, Yuxia Zhang, Weixing Ji, Hao Zhong, Lu Zhang
{"title":"Do bugs lead to unnaturalness of source code?","authors":"Yanjie Jiang, Hui Liu, Yuxia Zhang, Weixing Ji, Hao Zhong, Lu Zhang","doi":"10.1145/3540250.3549149","DOIUrl":"https://doi.org/10.1145/3540250.3549149","url":null,"abstract":"Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent property of source code. It was also reported that buggy code is significantly less natural than bug-free one, and bug fixing substantially improves the naturalness of the involved source code. In this paper, we revisit the naturalness of buggy code and investigate the effect of bug-fixing on the naturalness of source code. Different from the existing investigation, we leverage two large-scale and high-quality bug repositories where bug-irrelevant changes in bug-fixing commits have been explicitly excluded. Our evaluation results confirm that buggy lines are often less natural than bug-free ones. However, fixing bugs could not significantly improve the naturalness of involved code lines. Fixed lines on average are as unnatural as buggy ones. Consequently, bugs are not the root cause of the unnaturalness of source code, and it could be inaccurate to identify buggy code lines solely by the naturalness of source code. Our evaluation results suggest that the naturalness-based buggy line detection results in extremely low precision (less than one percentage).","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"331 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77595836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}