A Generalized Semantic Filter for Glossary Term Extraction from Large-Sized Software Requirements

14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference) Pub Date : 2021-02-25 DOI:10.1145/3452383.3452387

S. Mishra, Arpit Sharma

{"title":"A Generalized Semantic Filter for Glossary Term Extraction from Large-Sized Software Requirements","authors":"S. Mishra, Arpit Sharma","doi":"10.1145/3452383.3452387","DOIUrl":null,"url":null,"abstract":"A glossary is an essential component of every software requirements document. Extracting glossary terms manually from a large requirements document is expensive in terms of both time and cost required to do so. Additionally, this is also an error-prone task. To overcome these issues, we propose a generalized semantic filter which can automatically extract key technical terms present in a large body of software requirements. Our semantic filter is based on a word embeddings model which can identify domain-specific terms. To achieve this goal, a domain-neutral reference corpus is created containing data of news headlines published over a period of 17 years by Australian Broadcasting Corp news website. We use this domain-neutral corpus to calculate the similarity scores of potential glossary terms extracted using text chunking and coverage filtering on the requirements document. The key idea is that if the context of a candidate term in the requirements document is different from the context in which it was used in the domain-neutral corpus, then the term is labeled as domain-specific. Since our semantic filter is domain-neutral, it can potentially be applied to requirements documents of any application domain. Our proposed technique has been applied to the CrowdRE document which is a large-sized document with roughly 3000 user stories for smart home application domain. Results show that our approach is very effective for glossary extraction from enormous documents containing software requirements.","PeriodicalId":378352,"journal":{"name":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","volume":"337 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452383.3452387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

A glossary is an essential component of every software requirements document. Extracting glossary terms manually from a large requirements document is expensive in terms of both time and cost required to do so. Additionally, this is also an error-prone task. To overcome these issues, we propose a generalized semantic filter which can automatically extract key technical terms present in a large body of software requirements. Our semantic filter is based on a word embeddings model which can identify domain-specific terms. To achieve this goal, a domain-neutral reference corpus is created containing data of news headlines published over a period of 17 years by Australian Broadcasting Corp news website. We use this domain-neutral corpus to calculate the similarity scores of potential glossary terms extracted using text chunking and coverage filtering on the requirements document. The key idea is that if the context of a candidate term in the requirements document is different from the context in which it was used in the domain-neutral corpus, then the term is labeled as domain-specific. Since our semantic filter is domain-neutral, it can potentially be applied to requirements documents of any application domain. Our proposed technique has been applied to the CrowdRE document which is a large-sized document with roughly 3000 user stories for smart home application domain. Results show that our approach is very effective for glossary extraction from enormous documents containing software requirements.

查看原文本刊更多论文

基于广义语义过滤器的大型软件需求术语提取

术语表是每个软件需求文档的重要组成部分。手动从大型需求文档中提取术语表术语在时间和成本方面都是非常昂贵的。此外，这也是一个容易出错的任务。为了克服这些问题，我们提出了一种广义语义过滤器，它可以自动提取大量软件需求中的关键技术术语。我们的语义过滤器是基于一个词嵌入模型，可以识别特定领域的术语。为了实现这一目标，我们创建了一个领域中立的参考语料库，其中包含了澳大利亚广播公司新闻网站17年来发布的新闻标题数据。我们使用这个领域中立的语料库来计算使用文本分块和覆盖过滤在需求文档上提取的潜在词汇表术语的相似度分数。关键思想是，如果需求文档中候选术语的上下文不同于在领域中立语料库中使用它的上下文，那么该术语将被标记为领域特定的。由于我们的语义过滤器是领域中立的，因此它可以潜在地应用于任何应用程序领域的需求文档。我们提出的技术已经应用于CrowdRE文档，这是一个大型文档，大约有3000个智能家居应用领域的用户故事。结果表明，我们的方法对于从包含软件需求的大量文档中提取词汇表非常有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)

自引率

0.00%

发文量