Evaluation of Tools for Hairy Requirements and Software Engineering Tasks

2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) Pub Date : 2017-09-01 DOI:10.1109/REW.2017.25

D. Berry

{"title":"Evaluation of Tools for Hairy Requirements and Software Engineering Tasks","authors":"D. Berry","doi":"10.1109/REW.2017.25","DOIUrl":null,"url":null,"abstract":"Context and Motivation A hairy requirements or software engineering task involving natural language (NL) documents is one that is not inherently difficult for NL-understanding humans on a small scale but becomes unmanageable in the large scale. A hairy task demands tool assistance. Because humans need help in carrying out a hairy task completely, a tool for a hairy task should have as close to 100% recall as possible. A hairy task tool that falls short of close to 100% recall that is applied to the development of a high-dependability system may even be useless, because to find the missing information, a human has to do the entire task manually anyway. For a such a tool to have recall acceptably close to 100%, a human working with the tool on the task must achieve better recall than a human working on the task entirely manually. Problem Traditionally, many hairy requirements and software engineering tools have been evaluated mainly by how high their precision is, possibly leading to incorrect conclusions about how effective they are. Principal Ideas This paper describes using recall, a properly weighted F-measure, and a new measure called summarization to evaluate tools for hairy requirements and software engineering tasks and applies some of these measures to several tools reported in the literature. Contribution The finding is that some of these tools are actually better than they were thought to be when they were evaluated using mainly precision or an unweighted F-measure.","PeriodicalId":382958,"journal":{"name":"2017 IEEE 25th International Requirements Engineering Conference Workshops (REW)","volume":"28 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th International Requirements Engineering Conference Workshops (REW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/REW.2017.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

Abstract

Context and Motivation A hairy requirements or software engineering task involving natural language (NL) documents is one that is not inherently difficult for NL-understanding humans on a small scale but becomes unmanageable in the large scale. A hairy task demands tool assistance. Because humans need help in carrying out a hairy task completely, a tool for a hairy task should have as close to 100% recall as possible. A hairy task tool that falls short of close to 100% recall that is applied to the development of a high-dependability system may even be useless, because to find the missing information, a human has to do the entire task manually anyway. For a such a tool to have recall acceptably close to 100%, a human working with the tool on the task must achieve better recall than a human working on the task entirely manually. Problem Traditionally, many hairy requirements and software engineering tools have been evaluated mainly by how high their precision is, possibly leading to incorrect conclusions about how effective they are. Principal Ideas This paper describes using recall, a properly weighted F-measure, and a new measure called summarization to evaluate tools for hairy requirements and software engineering tasks and applies some of these measures to several tools reported in the literature. Contribution The finding is that some of these tools are actually better than they were thought to be when they were evaluated using mainly precision or an unweighted F-measure.

查看原文本刊更多论文

针对复杂需求和软件工程任务的工具评估

涉及自然语言(NL)文档的复杂需求或软件工程任务，对于理解自然语言的人类来说，在小规模范围内并不难，但在大规模范围内就变得难以管理。一项棘手的任务需要工具的帮助。因为人类在完成复杂任务时需要帮助，所以处理复杂任务的工具应该尽可能接近100%的召回率。在开发高可靠性系统时，如果一个复杂的任务工具不能达到100%的召回率，那么它甚至可能是无用的，因为要找到缺失的信息，人类无论如何都必须手动完成整个任务。要使这样的工具具有可接受的接近100%的召回率，使用该工具处理任务的人必须比完全手动处理任务的人获得更好的召回率。传统上，许多复杂的需求和软件工程工具主要是通过它们的精度来评估的，这可能会导致关于它们的有效性的错误结论。本文描述了使用召回、适当加权的f度量和一种称为总结的新度量来评估复杂需求和软件工程任务的工具，并将其中一些度量应用于文献中报道的几个工具。研究发现，当主要使用精度或未加权的f指标来评估这些工具时，其中一些工具实际上比人们认为的要好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 25th International Requirements Engineering Conference Workshops (REW)

自引率

0.00%

发文量