Synthetic data generation for domain adaptation of a retriever-reader Question Answering system for the Telecom domain : Comparing dense embeddings with BM25 for Open Domain Question Answering

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Having computer systems capable of answering questions has been a goal within Natural Language Processing research for many years. Machine Learning systems have recently become increasingly proficient at this task with large language models obtaining state-of-the-art performance. Retriever-reader architectures have become a powerful approach for building systems that enable users to enter questions and get factual answers from a corpus of documents. This architecture uses a retriever component that fetches the most relevant documents and a reader which in turn extracts the answer from the documents. These systems commonly use transformer-based models for both components, which have been fine-tuned on a general domain of documents, such as Wikipedia. However, the performance of such systems on new domains, with different vocabularies, can be lacking. Furthermore, new domains of, for instance, company-specific documents often lack annotated data which makes training new models cumbersome. This thesis investigated how a retriever-reader-based architecture can be adapted to a corpus of Telecom documents by generating question-answer data using a large generative language model, GPT3.5. Also, it compared the usage of a dense retriever using BERT to a BM25-based retriever on the domain. Findings suggest that generating training data can be an effective approach for fine-tuning a dense retriever, increasing the Top-K retrieval accuracy by 20 points for k = 10, compared to a dense retriever fine-tuned on Wikipedia. Additionally, it is found that the sparse retriever outperforms the best dense retriever, although, there is reason to believe that the structure of the test dataset could influence this. Finally, the results also indicate that the performance of the reader is not improved by the generated data although future work is needed to draw better conclusions.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)