Towards terminology-based keyword extraction

University essay from Linköpings universitet/Institutionen för datavetenskap

Abstract: The digitization of information has provided an overflow of data in many areas of society, including the clinical sector. However, confidentiality issues concerning the privacy of both clinicians and patients have hampered research into how to best deal with this kind of "clinical" data. An example of clinical data which can be found in abundance are Electronic Medical Records, or EMR for short. EMRs contain information about a patient's medical history, such as summarizes of earlier visits, prescribed medications and more. These EMRs can be quite extensive and reading them in full can be time-consuming, especially when considering the often hectic nature of hospital work. Giving clinicians the ability to gain insight into what information is of importance when dealing with extensive EMRs might be very useful. Keyword extraction are methods developed in the field of language technology that aim to automatically extract the most important terms or phrases from a text. Applying these methods on EMR data successfully could help provide the clinicians with a helping hand when short on time. Clinical data are very domain-specific however, requiring different kinds of expert knowledge depending on what field of medicine is being investigated. Due to the scarcity of research on not only clinical keyword extractions but clinical data as a whole, foundational groundwork in how to best deal with the domain-specific demands of a clinical keyword extractor need to be laid. By exploring how the two unsupervised approaches YAKE! and KeyBERT deal with the domain-specific task of implant-focused keyword extraction, the limitations of clinical keyword extraction are tested. Furthermore, the performance of a general BERT model in comparison to a model finetuned on domain-specific data is investigated. Finally, an attempt is made to create a domain-specific set of gold-standard keywords by combining unsupervised approaches to keyword extraction is made. The results show that unsupervised approaches perform poorly when dealing with domain-specific tasks that do not have a clear correlation to the main domain of the data. Finetuned BERT models seem to perform almost as well as a general model when tasked with implant-focused keyword extraction, although further research is needed. Finally, the use of unsupervised approaches in conjunction with manual evaluations provided by domain experts show some promise.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)