{"title":"A New Feature Hashing Approach Based on Term Weight for Dimensional Reduction","authors":"Abubakar Ado, N. Samsudin, M. M. Deris","doi":"10.1109/ICOTEN52080.2021.9493447","DOIUrl":null,"url":null,"abstract":"Machine learning models usually face a problem when encountered with large scale text dataset. Such kind of dataset produces sparse features of a high-dimensional, which makes it complex or infeasible to process by the learning models. Feature hashing is a dimensional reduction technique commonly used in the pre-processing phase to overcome the aforementioned problem. However, models performance are negatively affected due to the inherited so-called collisions that occur during the hashing process. In this study, we proposed a new Feature hashing approach that hashes similar features to the same bin based on their weight known as \"weight term\" while minimizing certain collisions. The approach effectively reduces the collisions between dissimilar features, thus improving model performance. The experiment results conducted on binary and multi-class classification datasets with a very high number of sparse features show that the proposed approach achieved competitive performance compared with the conventional FH.","PeriodicalId":308802,"journal":{"name":"2021 International Congress of Advanced Technology and Engineering (ICOTEN)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Congress of Advanced Technology and Engineering (ICOTEN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOTEN52080.2021.9493447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Machine learning models usually face a problem when encountered with large scale text dataset. Such kind of dataset produces sparse features of a high-dimensional, which makes it complex or infeasible to process by the learning models. Feature hashing is a dimensional reduction technique commonly used in the pre-processing phase to overcome the aforementioned problem. However, models performance are negatively affected due to the inherited so-called collisions that occur during the hashing process. In this study, we proposed a new Feature hashing approach that hashes similar features to the same bin based on their weight known as "weight term" while minimizing certain collisions. The approach effectively reduces the collisions between dissimilar features, thus improving model performance. The experiment results conducted on binary and multi-class classification datasets with a very high number of sparse features show that the proposed approach achieved competitive performance compared with the conventional FH.