Dealing with the data imbalance problem on pulsar candidates sifting based on feature selection

IF 1.8 4区 物理与天体物理 Q3 ASTRONOMY & ASTROPHYSICS
Haitao Lin, Xiangru Li
{"title":"Dealing with the data imbalance problem on pulsar candidates sifting based on feature selection","authors":"Haitao Lin, Xiangru Li","doi":"10.1088/1674-4527/ad0c26","DOIUrl":null,"url":null,"abstract":"Abstract Pulsar detection has become an active research topic in radio astronomy recently. One of the essential procedures for pulsar detection is pulsar candidate sifting (PCS), a procedure of find out the potential pulsar signals in a survey. However, pulsar candidates are always class-imbalanced, as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars. Class imbalance has greatly damaged the performance of machine learning (ML) models, resulting in a heavy cost as some real pulsars are misjudged.
To deal with the problem, techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on, which is known as {\\itshape feature selection}. Feature selection is a process of selecting a subset of the most relevant features from a feature pool. The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced.
In this work, an algorithm of feature selection called K-fold Relief-Greedy algorithm (KFRG) is designed. KFRG is a two-stage algorithm. In the first stage, it filters out some irrelevant features according to their K-fold Relief scores, while in the second stage, it removes the redundant features and selects the most relevant features by a forward greedy search strategy. Experiments on the dataset of the High Time Resolution Universe survey verified that ML models based on KFRG are capable for PCS, correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.","PeriodicalId":54494,"journal":{"name":"Research in Astronomy and Astrophysics","volume":"58 12","pages":"0"},"PeriodicalIF":1.8000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research in Astronomy and Astrophysics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/1674-4527/ad0c26","RegionNum":4,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ASTRONOMY & ASTROPHYSICS","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Pulsar detection has become an active research topic in radio astronomy recently. One of the essential procedures for pulsar detection is pulsar candidate sifting (PCS), a procedure of find out the potential pulsar signals in a survey. However, pulsar candidates are always class-imbalanced, as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars. Class imbalance has greatly damaged the performance of machine learning (ML) models, resulting in a heavy cost as some real pulsars are misjudged.
To deal with the problem, techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on, which is known as {\itshape feature selection}. Feature selection is a process of selecting a subset of the most relevant features from a feature pool. The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced.
In this work, an algorithm of feature selection called K-fold Relief-Greedy algorithm (KFRG) is designed. KFRG is a two-stage algorithm. In the first stage, it filters out some irrelevant features according to their K-fold Relief scores, while in the second stage, it removes the redundant features and selects the most relevant features by a forward greedy search strategy. Experiments on the dataset of the High Time Resolution Universe survey verified that ML models based on KFRG are capable for PCS, correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.
基于特征选择的脉冲星候选星筛选中数据不平衡问题的处理
脉冲星探测是近年来射电天文学研究的热点之一。脉冲星候选筛选(PCS)是脉冲星探测的重要步骤之一,它是在一次巡天中发现潜在脉冲星信号的过程。然而,脉冲星候选者总是类不平衡的,因为大多数候选者都是非脉冲星,比如RFI,只有很小一部分来自真正的脉冲星。类不平衡极大地损害了机器学习(ML)模型的性能,导致一些真实脉冲星被误判,成本很高。为了解决这一问题,重点研究了选择相关特征来区分脉冲星和非脉冲星的技术,即{\itshape feature selection}。特征选择是从特征池中选择最相关特征子集的过程。即使在数据高度不平衡的情况下,脉冲星和非脉冲星的特征区分也能显著提高分类器的性能。本文设计了一种特征选择算法——K-fold Relief-Greedy algorithm (KFRG)。KFRG是一个两阶段算法。在第一阶段,它根据K-fold Relief分数过滤掉一些不相关的特征,而在第二阶段,它通过前向贪婪搜索策略去除冗余特征并选择最相关的特征。在高时间分辨率宇宙巡天数据集上的实验验证了基于KFRG的ML模型能够正确地分离脉冲星和非脉冲星,即使候选脉冲星是高度不平衡的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Research in Astronomy and Astrophysics
Research in Astronomy and Astrophysics 地学天文-天文与天体物理
CiteScore
3.20
自引率
16.70%
发文量
2599
审稿时长
6.0 months
期刊介绍: Research in Astronomy and Astrophysics (RAA) is an international journal publishing original research papers and reviews across all branches of astronomy and astrophysics, with a particular interest in the following topics: -large-scale structure of universe formation and evolution of galaxies- high-energy and cataclysmic processes in astrophysics- formation and evolution of stars- astrogeodynamics- solar magnetic activity and heliogeospace environments- dynamics of celestial bodies in the solar system and artificial bodies- space observation and exploration- new astronomical techniques and methods
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信