{"title":"Creating Domain based Dictionary and its Evaluation using Classification Accuracy","authors":"Mansi Sood, Harmeet Kaur, Jaya Gera","doi":"10.1109/INDIACom51348.2021.00059","DOIUrl":null,"url":null,"abstract":"This paper creates a domain-based dictionary for different categories in the news domain. It extracts terms - unigrams and collocations (specifically bigrams and trigrams) to construct the dictionary. The created dictionary is then used to classify unseen news data. The paper studies how variation in two parameters - (i) occurrence frequency of extracted terms and (ii) window size of extracted bigrams impact the dictionary size and the classification accuracy. Subsequently, dictionary size and classification accuracy are analyzed by creating dictionaries comprising of just unigrams, bigrams, trigrams, or their combinations. A reasonably sized and accurate dictionary can be created using just bigrams. The inclusion of trigrams to the dictionary accounts for a slight accuracy gain. Including both unigrams and bigrams in the dictionary can achieve a high accuracy score with a significantly smaller dictionary size than adding just unigrams or bigrams to the dictionary. Also, the trio combination adding unigrams, bigrams, and trigrams to the dictionary does not lead to a significant increase in accuracy. These observations will help in creating a meaningful and compact dictionary.","PeriodicalId":415594,"journal":{"name":"2021 8th International Conference on Computing for Sustainable Global Development (INDIACom)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th International Conference on Computing for Sustainable Global Development (INDIACom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDIACom51348.2021.00059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This paper creates a domain-based dictionary for different categories in the news domain. It extracts terms - unigrams and collocations (specifically bigrams and trigrams) to construct the dictionary. The created dictionary is then used to classify unseen news data. The paper studies how variation in two parameters - (i) occurrence frequency of extracted terms and (ii) window size of extracted bigrams impact the dictionary size and the classification accuracy. Subsequently, dictionary size and classification accuracy are analyzed by creating dictionaries comprising of just unigrams, bigrams, trigrams, or their combinations. A reasonably sized and accurate dictionary can be created using just bigrams. The inclusion of trigrams to the dictionary accounts for a slight accuracy gain. Including both unigrams and bigrams in the dictionary can achieve a high accuracy score with a significantly smaller dictionary size than adding just unigrams or bigrams to the dictionary. Also, the trio combination adding unigrams, bigrams, and trigrams to the dictionary does not lead to a significant increase in accuracy. These observations will help in creating a meaningful and compact dictionary.