Measures of statistical dependence for feature selection : Computational study

University essay from Umeå universitet/Statistik

Author: Mohamad Alshalabi; [2022]

Keywords: ;

Abstract: The importance of feature selection for statistical and machine learning models derives from their explainability and the ability to explore new relationships, leading to new discoveries. Straightforward feature selection methods measure the dependencies between the potential features and the response variable. This thesis tries to study the selection of features according to a maximal statistical dependency criterion based ongeneralized Pearson’s correlation coefficients, e.g., Wijayatunga’s coefficient. I present a framework for feature selection based on these coefficients for high dimensional feature variables. The results are compared to the ones obtained by applying an elastic net regression (for high-dimensional data). The generalized Pearson’s correlation coefficient is a metric-based measure where the metric is Hellinger distance. The metric is considered as the distance between probability distributions. The Wijayatunga’s coefficient is originally proposed for the discrete case; here, we generalize it for continuous variables by discretization and kernelization. It is interesting to see how discretization work as we discretize the bins finer. The study employs both synthetic and real-world data to illustrate the validity and power of this feature selection process. Moreover, a new method of normalization for mutual information is included. The results show that both measures have considerable potential in detecting associations. The feature selection experiment shows that elastic net regression is superior to our proposed method; nevertheless, more investigation could be done regarding this subject.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)