University essay from Uppsala universitet/Statistiska institutionen

Abstract: Last observation carried forward (LOCF) is a common imputation method, regularly used for clinical data. It is based on the principle that the most recent observation that is known is carried forward to replace missing values. In this thesis, we investigate the effect that variable age has on sepsis prediction when used as a conditional decision variable for imputation. In an iterative experiment,  we combine the LOCF method with a more passive approach of model-inbuilt ways of handling missing data, using tree-based models. A measurement of variable age is created by measuring the distance in time between missing observations and the most recent known value.  Based on this measurement, different cut-off values based on variable-specific percentiles are evaluated during imputation. In the event of missing values, where the last known value is more or equally recent as the decided cutoff, imputation is made through LOCF. The remaining entries are retained as missing and handled by the model during prediction. Results based on out-of-sample prediction performance for increasing variable age percentile cutoffs suggest that too restrictive constraints on the variable age decrease predictive performance for CART and Random forest, whereas no such performance decrease is found for XGBoost. In addition, tendencies of a  slight decrease in performance are seen for higher variable percentiles as compared to the variable age interval that was found optimal in most cases. Finally, SHAP and LIME values show that there is a clear association between the variable age and prediction contributions for some variables. Further research is necessary to confirm and extend the results. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)