Tree positional encodings for transformer models on HTML DOM tree element classification : Enabling structurally aware transformer models through positional encodings to improve performance on an HTML element classification problem

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: With the continued proliferation of access and the usage of the internet, the field of web learning is continuously growing in order to automate and improve parts of our experience on the web. Research in web learning has often lagged behind its counterparts in Natural Language Processing (NLP), novel methods often reach adoption in web learning research with a delay. Web pages are more complex in both content and structure, as they are semi- structured documents divided into sections, often containing a combination of images, text, and markup. For humans, this is not difficult to understand, as we are familiar with the structure of web pages and in fact are often aided by the styling and markup of the pages. However for machine learning algorithms, this structure and mixture of content poses several challenges which are not similar in nature to comparable documents in NLP problems. Transformer models have shown significant performance gains on a multitude of tasks ranging from NLP to image processing. This thesis studies the usage of alternative and novel approaches to encoding positional information of nodes in a HyperText Markup Language (HTML) Document Object Model (DOM) tree in order to enable effective use of transformer models on web page data. The problem studied was a HTML element classification problem, specifically the task of extracting product data from a product web page. Three positional encodings for tree structured data were studied: Breadth First Search (BFS), Depth First Search (DFS), and novel tree positional encodings. These encodings resulted in 3 trained transformer models which were compared to a baseline transformer model trained with no positional encoding in order to measure the change in performance that the encodings produced. The analysis of the results show that the BFS and DFS encodings increased model performance across all measured metrics (precision, recall, f1-score, accuracy) by up to 1% in absolute performance. The novel tree positional encodings resulted in worse model performance across all metrics measured. The results show that transformers benefit from certain tree positional encodings of the HTML elements, and further research should be done to see how these positions can be effectively encoded for transformer models to process web pages. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)