{"title":"A Generalized Semantic Filter for Glossary Term Extraction from Large-Sized Software Requirements","authors":"S. Mishra, Arpit Sharma","doi":"10.1145/3452383.3452387","DOIUrl":null,"url":null,"abstract":"A glossary is an essential component of every software requirements document. Extracting glossary terms manually from a large requirements document is expensive in terms of both time and cost required to do so. Additionally, this is also an error-prone task. To overcome these issues, we propose a generalized semantic filter which can automatically extract key technical terms present in a large body of software requirements. Our semantic filter is based on a word embeddings model which can identify domain-specific terms. To achieve this goal, a domain-neutral reference corpus is created containing data of news headlines published over a period of 17 years by Australian Broadcasting Corp news website. We use this domain-neutral corpus to calculate the similarity scores of potential glossary terms extracted using text chunking and coverage filtering on the requirements document. The key idea is that if the context of a candidate term in the requirements document is different from the context in which it was used in the domain-neutral corpus, then the term is labeled as domain-specific. Since our semantic filter is domain-neutral, it can potentially be applied to requirements documents of any application domain. Our proposed technique has been applied to the CrowdRE document which is a large-sized document with roughly 3000 user stories for smart home application domain. Results show that our approach is very effective for glossary extraction from enormous documents containing software requirements.","PeriodicalId":378352,"journal":{"name":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","volume":"337 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"14th Innovations in Software Engineering Conference (formerly known as India Software Engineering Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452383.3452387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
A glossary is an essential component of every software requirements document. Extracting glossary terms manually from a large requirements document is expensive in terms of both time and cost required to do so. Additionally, this is also an error-prone task. To overcome these issues, we propose a generalized semantic filter which can automatically extract key technical terms present in a large body of software requirements. Our semantic filter is based on a word embeddings model which can identify domain-specific terms. To achieve this goal, a domain-neutral reference corpus is created containing data of news headlines published over a period of 17 years by Australian Broadcasting Corp news website. We use this domain-neutral corpus to calculate the similarity scores of potential glossary terms extracted using text chunking and coverage filtering on the requirements document. The key idea is that if the context of a candidate term in the requirements document is different from the context in which it was used in the domain-neutral corpus, then the term is labeled as domain-specific. Since our semantic filter is domain-neutral, it can potentially be applied to requirements documents of any application domain. Our proposed technique has been applied to the CrowdRE document which is a large-sized document with roughly 3000 user stories for smart home application domain. Results show that our approach is very effective for glossary extraction from enormous documents containing software requirements.