Evaluating the Effectiveness of Active Learning Methods in Predicting Biochemical Properties

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Markus Lucero; [2021]

Keywords: ;

Abstract: Replacing biological experiments that study the binding activity of compounds with predictive machine learning models is often difficult due to a lack of training data. This thesis examines the possibility of using Active Learning in addition to Supervised Learning to create classifiers that accurately predict binding activity while minimizing the number of experiments needed. Two learners were constructed, one using Random Forest and the other using a committee consisting of k-Nearest-Neighborsand Random Forest. Each learner was trained on 10 data sets. An oracle was then queried using k-batched sampling, selecting k labels to be queried based on compounds selected by uncertainty sampling, margin sampling, query-by-committee or expected error reduction. For each of the querying strategies, the ROC-AUCvalue was calculated and compared between each method and a random sampling control. The results show that it is possible to achieve high ROC-AUC values withthe help of Active Learning when predicting compound binding activity. Overall, uncertainty sampling performed best when compared with random sampling alongwith margin sampling. It was also found that increasing the batch size of each query negatively impacted the learning models. Large batch sizes did however quickly increase the ROC-AUC values for the tested strategies. In conclusion, ActiveLearning can be used to effectively predict compound binding activity, however, the machine learning models presented in this thesis do not scale well when placed in amore realistic laboratory context indicating that new methods need to be developed that scale better in a laboratory environment.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)