Enhancing decision tree accuracy and compactness with improved categorical split and sampling techniques

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Gaëtan Millerand; [2020]

Keywords: ;

Abstract: Decision tree is one of the most popular algorithms in the domain of explainable AI. From its structure, it is simple to induce a set of decision rules which are totally understandable for a human. That is why there is currently research on improving decision or mapping other models into a tree. Decision trees generated by C4.5 or ID3 tree suffer from two main issues. The first one is that they often have lower performances in term of accuracy for classification tasks or mean square error for regression tasks compared to state-of-the-art models like XGBoost or deep neural networks. On almost every task, there is an important gap between top models like XGboost and decision trees. This thesis addresses this problem by providing a new method based on data augmentation using state-of-the-art models which outperforms the old ones regarding evaluation metrics. The second problem is the compactness of the decision tree, as the depth increases the set of rules becomes exponentially big, especially when the splitted attribute is a categorical one. Standards solution to handle categorical values are to turn them into dummy variables or to split on each value producing complex models. A comparative study of current methods of splitting categorical values in classification problems is done in this thesis. A new method is also studied in the case of regression. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)