Models, Keys, and Cryptanalysis: Evaluating historical statistical language models in cryptanalysis of homophonic substitution ciphers

University essay from Göteborgs universitet/Institutionen för filosofi, lingvistik och vetenskapsteori

Abstract: This thesis presents an empirical study connected to historical cryptography and especially within the framework of the research project DECRYPT. One of the research questions in the DECRYPT project relates to the use of language models for automatic cryptanalysis. In particular, whether historical language data result in more performant models than large scale models generated from contemporary language corpora. The present thesis aims to explore this question for the English language applied to the classical cipher known as homophonic substitution. Key complexity and message lengths are also taken into consideration. A shorter survey of real historic cryptological keys is also performed in order to gain insights into key design. Statistical n-gram models are generated from the HistCORP collection of historical language and corpora. Test data is generated from the same dataset and encrypted with keys of different complexity. Each sample of test data is then cryptanalysed with a publicly available algorithm for cryptanalysis, and the results from different models are evaluated and compared. The results of the experiments show that there are tendencies that historical texts are better analysed with models based on historical language data. In particular, the performance seems to correlate with the evolution of orthography. Key complexity and message length influence the results, where a less complex key and longer message length generally lead to better accuracy of the cryptanalysis. The results can be viewed as a stepping stone into the broader question of automatic cryptanalysis of historical ciphers, and how suitable language models could or should be assembled.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)