Unsupervised Variable Selection Using a Genetic Algorithm: An Application to Textual Data

Abbas Rammal, Kenneth Ezukwoke, Anis Hoayek, M. Batton-Hubert
{"title":"Unsupervised Variable Selection Using a Genetic Algorithm: An Application to Textual Data","authors":"Abbas Rammal, Kenneth Ezukwoke, Anis Hoayek, M. Batton-Hubert","doi":"10.1109/IC2SPM56638.2022.9989008","DOIUrl":null,"url":null,"abstract":"Microelectronics production failure analysis is an important step in improving product quality and development. Indeed, the understanding of the failure mechanisms and therefore the implementation of corrective actions on the cause of the failure depend on the results of these analysis. These analysis are saved under textual features format. Then such data need first to be pre-processed and vectorized (converted to numeric). Second, to overcome the curse of dimensionality caused by the vectorisation process, a dimension reduction is applied. We are first interested in studying the potential of using an unsupervised variable selection technique to identify the variables that best demonstrate discrimination in the separation and compactness of groups of textual data. Variable selection has been approached by several variable or feature selection methods. Some of them have not been adapted for use in large data sets or are difficult to tune, and others require additional information. This work investigates the potential of using a genetic algorithm to find, in an unsupervised way, the variables allowing the best discrimination of the classes, to select variables correlated to particular textual groups. The proosed genetic algorithm uses a combination of the K-means clustering and validity indices as a fitness function for optimization. Such a function improves both compactness and class separation. Experiments on textual datasets demonstrate the effectiveness of the proposed method of variable selection which allows better discrimination of textual classes compared to the use of K-means clustering on all data variables.","PeriodicalId":179072,"journal":{"name":"2022 International Conference on Smart Systems and Power Management (IC2SPM)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Smart Systems and Power Management (IC2SPM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2SPM56638.2022.9989008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Microelectronics production failure analysis is an important step in improving product quality and development. Indeed, the understanding of the failure mechanisms and therefore the implementation of corrective actions on the cause of the failure depend on the results of these analysis. These analysis are saved under textual features format. Then such data need first to be pre-processed and vectorized (converted to numeric). Second, to overcome the curse of dimensionality caused by the vectorisation process, a dimension reduction is applied. We are first interested in studying the potential of using an unsupervised variable selection technique to identify the variables that best demonstrate discrimination in the separation and compactness of groups of textual data. Variable selection has been approached by several variable or feature selection methods. Some of them have not been adapted for use in large data sets or are difficult to tune, and others require additional information. This work investigates the potential of using a genetic algorithm to find, in an unsupervised way, the variables allowing the best discrimination of the classes, to select variables correlated to particular textual groups. The proosed genetic algorithm uses a combination of the K-means clustering and validity indices as a fitness function for optimization. Such a function improves both compactness and class separation. Experiments on textual datasets demonstrate the effectiveness of the proposed method of variable selection which allows better discrimination of textual classes compared to the use of K-means clustering on all data variables.
基于遗传算法的无监督变量选择:在文本数据中的应用
微电子产品失效分析是提高产品质量和发展的重要环节。事实上,对失效机制的理解以及对失效原因的纠正措施的实施取决于这些分析的结果。分析结果以文本特征格式保存。然后,这些数据需要首先进行预处理和矢量化(转换为数字)。其次,为了克服矢量化过程造成的维数缺陷,采用了降维方法。我们首先感兴趣的是研究使用无监督变量选择技术的潜力,以识别在文本数据组的分离和紧凑性中最能证明区别的变量。变量选择已经通过几种变量或特征选择方法来解决。其中一些还不适合在大型数据集中使用,或者难以调优,而另一些则需要额外的信息。这项工作研究了使用遗传算法的潜力,以一种无监督的方式,找到允许最佳分类的变量,选择与特定文本组相关的变量。提出的遗传算法采用k均值聚类和有效性指标相结合的适应度函数进行优化。这样的函数提高了紧凑性和类分离。在文本数据集上的实验证明了所提出的变量选择方法的有效性,与在所有数据变量上使用K-means聚类相比,它可以更好地区分文本类别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信