Knowledge Distillation of DNABERT for Prediction of Genomic Elements

University essay from KTH/Skolan för kemi, bioteknologi och hälsa (CBH)

Abstract: Understanding the information encoded in the human genome and the influence of each part of the DNA sequence is a fundamental problem of our society that can be key to unveil the mechanism of common diseases. With the latest technological developments in the genomics field, many research institutes have the tools to collect massive amounts of genomic data. Nevertheless, there is a lack of tools that can be used to process and analyse these datasets in a biologically reliable and efficient manner. Many deep learning solutions have been proposed to solve current genomic tasks, but most of the times the main research interest is in the underlying biological mechanisms rather than high scores of the predictive metrics themselves. Recently, state-of-the-art in deep learning has shifted towards large transformer models, which use an attention mechanism that can be leveraged for interpretability. The main drawbacks of these large models is that they require a lot of memory space and have high inference time, which may make their use unfeasible in practical applications. In this work, we test the appropriateness of knowledge distillation to obtain more usable and equally performing models that genomic researchers can easily fine-tune to solve their scientific problems. DNABERT, a transformer model pre-trained on DNA data, is distilled following two strategies: DistilBERT and MiniLM. Four student models with different sizes are obtained and fine-tuned for promoter identification. They are evaluated in three key aspects: classification performance, usability and biological relevance of the predictions. The latter is assessed by visually inspecting the attention maps of TATA-promoter predictions, which are expected to have a peak of attention at the well-known TATA motif present in these sequences. Results show that is indeed possible to obtain significantly smaller models that are equally performant in the promoter identification task without any major differences between the two techniques tested. The smallest distilled model experiences less than 1% decrease in all performance metrics evaluated (accuracy, F1 score and Matthews Correlation Coefficient) and an increase in the inference speed by 7.3x, while only having 15% of the parameters of DNABERT. The attention maps for the student models show that they successfully learn to mimic the general understanding of the DNA that DNABERT possesses.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)