Redundant and Irrelevant Attribute Elimination using Autoencoders

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Abstract: Real-world data can often be high-dimensional and contain redundant or irrelevant attributes. High-dimensional data are problematic for machine learning as the high dimensionality causes learning to take more time and, unless the dataset is sufficiently large to provide an ample number of samples for each class, the accuracy will suffer. Redundant and irrelevant attributes cause the data to take on a higher dimensionality than necessary and obfuscates the important attributes. Because of this, it is of interest to be able to reduce the dimensionality of the data whilst preserving the important attributes. Several techniques have been presented in the field of computer science in order to reduce the dimensionality of data. One of these is the autoencoder which is an unsupervised learning neural network which uses its input as the target output, and by limiting the number of neurons in the hidden layer the autoencoder is forced to learn a lower dimensional representation of the data. This study focuses on using the autoencoder to reduce the dimensionality, and eliminate irrelevant or redundant attributes, of four different datasets from different domains. The results show that the autoencoder can eliminate redundant attributes, that are a linear combination of the other attributes, and provide a better lower dimensional representation of the data than that of the unreduced data. However, in data that is gathered under a controlled and carefully managed situation, the autoencoder cannot always provide a better lower dimensional representation than the data with redundant attributes. Lastly, the results show that the autoencoder cannot eliminate irrelevant attributes which have no correlation to the class or other attributes.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)