Feature Selection Methods with Applications in Electrical Load Forecasting

University essay from Lunds universitet/Matematik LTH

Abstract: The purpose of this thesis is two-fold: implement and evaluate a method, the Fast Correlation-Based Filter (FCBF) by Yu et al., for feature selection applied on a meteorological data set consisting of 19 weather variables from 606 locations in Scandinavia, and investigate whether geography can be exploited in the search for relevant features. Four areas are chosen as target areas where load prediction error is evaluated as a measure of goodness. A subset of the total data set is used to lower the computation time; only Swedish locations were used, and only data from SMHI was used. The impact of using different subsets of weather features as well as selecting features from several locations is investigated using FCBF and epsilon-Support Vector Regression. A modification to the FCBF algorithm is tested in one of the experiments, using Pearson correlation in place of symmetrical uncertainty. An investigation of how the relationships between features change with distance is performed and the results are then used to motivate a greedy feature selection method. FCBF, even when implemented with the naive approximation of marginal and conditional entropy, filtered the total data set from 3180 to approximately 20 features with a prediction error of less than 1% for three of the target areas and 1.71% for the fourth. Further tests lowered the numbered of features even further without significantly affecting the prediction error. Using FCBF to rank the weather variables for a single area proved less than optimal which may be attributed to many of the extremely small intra-feature SU values. Selecting locations based on distance from target area resulted in prediction errors better than random sampling and comparable to the filter while still keeping the number of features low. The very best feature selection results were only slightly lower than a base case, suggesting that the present experimental setting may not be enough to draw definitive conclusions regarding the efficacy of the selection methods. Two possible contributing factors are the unoptimized model used, and the choice to investigate the impact on average load over a 24 hour window. Future studies may also wish to extend the geographical investigation to use coordinates or direction in conjunction with distance from the target area, as some indication of latitude dependent behavior was found, most likely contributed by the elongated shape of Sweden.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)