Vol 24, No 2

Dealing with the Data Imbalance Problem in Pulsar Candidate Sifting Based on Feature Selection

Haitao Lin and Xiangru Li


Pulsar detection has become an active research topic in radio astronomy recently. One of the essential procedures for pulsar detection is pulsar candidate sifting (PCS), a procedure for identifying potential pulsar signals in a survey. However, pulsar candidates are always class-imbalanced, as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars. Class imbalance can greatly affect the performance of machine learning (ML) models, resulting in a heavy cost as some real pulsars are misjudged. To deal with the problem, techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on, which is known as feature selection. Feature selection is a process of selecting a subset of the most relevant features from a feature pool. The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced. In this work, an algorithm for feature selection called the K-fold Relief-Greedy (KFRG) algorithm is designed. KFRG is a two-stage algorithm. In the first stage, it filters out some irrelevant features according to their K-fold Relief scores, while in the second stage, it removes the redundant features and selects the most relevant features by a forward greedy search strategy. Experiments on the data set of the High Time Resolution Universe survey verified that ML models based on KFRG are capable of PCS, correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.


Key words: methods: data analysis – (stars:) pulsars: general – methods: statistical

Full Text

There are currently no refbacks.