Predictive modelling using a nationally representative database to identify the determinants of prediabetes; a machine learning analytic approach on the National Health and Nutrition Examination Survey (NHANES) 2013-2014

University essay from Lunds universitet/Socialmedicin och global hälsa

Abstract: ABSTRACT Background: Prediabetes is a global epidemic with rising prevalence rates, but its diagnosis based on traditional risk factors is challenging. Application of novel machine-intelligence based methods to public health databases could provide valuable insights into the disease process. Aim: To build predictive models to elucidate the determinants of prediabetes using machine learning algorithms on a nationally representative sample of the US population. Method: Two datasets containing general (n = 6346) and dental (n = 3167) variables were prepared from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 and were randomly partitioned to create train and internal validation data. Feature selection algorithms were run on the train (n = 3174) data containing 156 pre-selected general variables. Five machine learning algorithms were applied on train data containing general (n = 3174) and dental (n = 1584) variables as well as on re-sampled datasets built using 4 resampling methods. Predictive models were tested on internal validation data containing general (n = 3172) and dental (n = 1583) variables. External validation was done on 2 datasets containing general (n = 3000) and dental (n = 1500) variables prepared from the NHANES 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUC). Determinants were elucidated by odds ratios in logistic regression models and by variable importance values in other algorithms. The CDC prediabetes screening tool was chosen as the benchmark against which the performance of optimal models was compared. Results: Seven optimal (>70% AUC) models built on the dataset containing general variables elucidated 25 determinants of prediabetes including a few novel associations; 20 were identified by both logistic regression and other non-linear/ensemble models while 5 were solely elucidated by the latter. Dental variables by themselves were not predictive of, and periodontitis appeared the only dental determinant of, prediabetes. The optimal machine learning model (AUC = 71.6%) built on the data containing general variables outperformed the chosen benchmark while that built on dental data equaled the performance of the screening tool. Conclusion: A range of determinants of prediabetes was identified through validated and benchmarked models highlighting the potential of a systematic, machine intelligence-based modelling approach on a public health database to elucidate the determinants of prediabetes including novel predictors. Keywords: prediabetes, determinants, machine learning, feature selection, NHANES

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)