Efficient learning on high-dimensional operational data
Abstract: In a networked system, operational data collected by sensors or extracted from system logs can be used for target performance prediction, anomaly detection, etc. However, the number of metrics collected from a networked system is very large and usually can reach about 106 for a medium-sized system. This project aims to analyze and compare different unsupervised machine learning methods such as Unsupervised Feature Selection, Principle Component Analysis, Autoencoder, which can lead to efficient learning from high-dimensional data. The objective is to reduce the dimensionality of the input space while maintaining the prediction performance when compared with the learning on the full feature space. The data used in this project is collected from a KTH testbed which runs a Video-on-Demand service and a Key-Value store under different types of traffic load. The findings confirm the manifold hypothesis, which states that real-world high-dimensional data lie on lowdimensional manifolds embedded within the high-dimensional space. In addition, this project investigates data visualization of infrastructure measurements through two-dimensional plots. The results show that we can achieve data separation by using different mapping methods.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)