Improving Missing Data Imputation using Generative Adversarial Network-based Methods

University essay from Lunds universitet/Matematisk statistik

Abstract: In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and effort. In addition to this, Generative Adversarial Networks (GANs) have lately gained attention as a recent breakthrough in machine learning, offering novel possibilities for data handling. This study explores two aspects in which GANs can potentially can improve data imputation. Firstly, the performance of an imputation-focused GAN model, GAIN, is compared against other state-of-the-art methods through an extensive evaluation. Secondly, the impact of incorporating synthesized data, generated by a GAN framework named CTGAN, into the training data of imputation models is evaluated. Our findings reveal that GAIN was outperformed by other data imputation methods. Despite this, its potential is not questioned, as further optimization of hyperparameters and network structure specific to the data set is believed to enhance its performance. The result of this study however emphasizes the clear challenges of the time-consuming training and optimization processes of GANs in general. Conversely, the additional data generated by CTGAN had a significant positive impact on the result of kNN imputation. Not only does the additional data strenghtens kNN imputation's position as the most prominent method in the study in terms of predictive performance, but it also serves as the most significant contribution from this report as the methodology has not been examined in previous research. Further, the practical feasibility of the method combined with its strong results makes it suitable for practical applications. To sum up, the findings underscore the potential for further enhancements in data imputation using GANs.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)