Automatic fingerprinting of websites

University essay from KTH/Hälsoinformatik och logistik

Author: Alfred Berg; Norton Lamberg; [2020]

Keywords: ;

Abstract: Abstract Fingerprinting a website is the process of identifying what technologies a websiteuses, such as their used web applications and JavaScript frameworks. Currentfingerprinting methods use manually created fingerprints for each technology itlooks for. These fingerprints consist of multiple text strings that are matchedagainst an HTTP response from a website. Creating these fingerprints for eachtechnology can be time-consuming, which limits what technologies fingerprints canbe built for. This thesis presents a potential solution by utilizing unsupervisedmachine learning techniques to cluster websites by their used web application andJavaScript frameworks, without requiring manually created fingerprints. Oursolution uses multiple bag-of-words models combined with the dimensionalityreduction technique t-SNE and clustering algorithm OPTICS. Results show thatsome technologies, for example, Drupal, achieve a precision of 0.731 and recall of0.485 without any training data. These results lead to the conclusion that theproposed solution could plausibly be used to cluster websites by their webapplication and JavaScript frameworks in use. However, further work is needed toincrease the precision and recall of the results. Keywords Clustering, fingerprinting, OPTICS, t-SNE, headless browser, bag-of-words,unsupervised machine learning

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)