{"title":"Clustering Glossary Terms Extracted from Large-Sized Software Requirements using FastText","authors":"Kushagra Bhatia, S. Mishra, Arpit Sharma","doi":"10.1145/3385032.3385039","DOIUrl":null,"url":null,"abstract":"Specialized terms used in the requirements document should be defined in a glossary. We propose a technique for automated extraction and clustering of glossary terms from large-sized requirements documents. We use text chunking combined with WordNet removal to extract candidate glossary terms. Next, we apply a state-of-the art neural word embeddings model for clustering glossary terms based on semantic similarity measures. Word embeddings are capable of capturing the context of a word and compute its semantic similarity relation with other words used in a document. Its use for clustering ensures that terms that are used in similar ways belong to the same cluster. We apply our technique to the CrowdRE dataset, which is a large-sized dataset with around 3000 crowd-generated requirements for smart home applications. To measure the effectiveness of our extraction and clustering technique we manually extract and cluster the glossary terms from CrowdRE dataset and use it for computing precision, recall and coverage. Results indicate that our approach can be very useful for extracting and clustering of glossary terms from a large body of requirements.","PeriodicalId":382901,"journal":{"name":"Proceedings of the 13th Innovations in Software Engineering Conference on Formerly known as India Software Engineering Conference","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th Innovations in Software Engineering Conference on Formerly known as India Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3385032.3385039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Specialized terms used in the requirements document should be defined in a glossary. We propose a technique for automated extraction and clustering of glossary terms from large-sized requirements documents. We use text chunking combined with WordNet removal to extract candidate glossary terms. Next, we apply a state-of-the art neural word embeddings model for clustering glossary terms based on semantic similarity measures. Word embeddings are capable of capturing the context of a word and compute its semantic similarity relation with other words used in a document. Its use for clustering ensures that terms that are used in similar ways belong to the same cluster. We apply our technique to the CrowdRE dataset, which is a large-sized dataset with around 3000 crowd-generated requirements for smart home applications. To measure the effectiveness of our extraction and clustering technique we manually extract and cluster the glossary terms from CrowdRE dataset and use it for computing precision, recall and coverage. Results indicate that our approach can be very useful for extracting and clustering of glossary terms from a large body of requirements.