Application of Machine Learning on a Genome-Wide Association Studies Dataset

University essay from KTH/Numerisk analys, NA

Author: Agnes Martine Nielsen; [2015]

Keywords: ;

Abstract: The number of individuals affected by type 2 diabetes is rapidly increasing. The goal of this thesis is to investigate if type 2 diabetes can be predicted more accurately from genome-wide association data using machine learning methods opposed to traditional statistical methods. A variable selection process using random forest has been performed and the variables in the genome, called Single Nucleotide Polymorphisms (SNPs), showing the highest importance for the prediction of type 2 diabetes have been identified. It has then been considered if including these SNPs in the models over only using clinical variables or previously univariately identified SNPs will improve the performance. Furthermore, the possible improvement by using random forest over logistic regression has been considered. The analysis has resulted in identifying genes through the SNPs that are related to biological functions related to type 2 diabetes. This includes genes which have not been directly associated with the disease. These are interesting for future study. However, the results show little to no improvement in prediction performance over models using only clinical variables suggesting that the signal for type 2 diabetes in the genome-wide association dataset is weak. Similarly, there is no improvement from using random forest over logistic regression for the final models suggesting that the linear signal in the genome data is much stronger than any non-linear signal.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)