Clustering unstructured life sciences experiments with unsupervised machine learning : Natural language processing for unstructured life sciences texts

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Mathias Dail; [2019]

Keywords: ;

Abstract: The purpose of this master’s thesis is to analyse different types of document representations in the context of improving, in an unsupervised manner, the searchability of unstructured textual life sciences experiments by clustering similar experiments together. The challenge is to produce, analyse and compare different representations of the life sciences data by using traditional and advanced unsupervised Machine learning models. The text data analysed in this work is noisy and very heterogeneous, as it comes from a real-world Electronic Lab Notebook. Clustering unstructured and unlabeled text experiments is challenging. It requires the creation of representations based only on the relevant information existing in an experiment. This work studies statistical and generative techniques, word embeddings and some of the most recent deep learning models in Natural Language Processing to create the various representation of the studied data. It explores the possibility of combining multiple techniques and using external life-sciences knowledge-bases to create richer representations before applying clustering algorithms. Different types of analysis are performed, including an assessment done by experts, to evaluate and compare the scientific relevance of the cluster of experiments created by the different data representations. The results show that traditional statistical techniques can still produce good baselines. Modern deep learning techniques have been shown to model the studied data well and create rich representations. Combining multiple techniques with external knowledge (biomedical and life-science-related ontologies) have been shown to produce the best results in grouping similar relevant experiments together. The different studied techniques enable to model different, and complementary aspects of a text, therefore combining them is a key to significantly improve the clustering of unstructured data.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)