A comparison of Data Stores for the Online Feature Store Component : A comparison between NDB and Aerospike

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: This thesis aimed to investigate what Data Stores would fit to be implemented as an Online Feature Store. This is a component in the Machine Learning infrastructure that needs to be able to handle low latency Reads at high throughput with high availability. The thesis evaluated the Data Stores with real feature workloads from Spotify’s Search system. First an investigation was made to find suitable storage systems. NDB and Aerospike were selected because of their state-of-the-art performance together with their suitable functionality. These were then implemented as the Online Feature Store by batch Reading the feature data through a Java program and by using Google Dataflow to input data to the Data Stores. For 1 client NDB achieved about 35% higher batch Read throughput with around 30% lower P99 latency than Aerospike. For 8 clients NDB got 20% higher batch Read throughput, with a varying P99 latency different compared to Aerospike. But in a 8 node setup NDB achieved on average 35% lower latency. Aerospike achieved 50% fasterWrite speeds when writing feature data to the Data Stores. Both Data Stores’ Read performance was found to suffer upon Writing to the data store at the same time as Reading, with the P99 Read latency increasing around 30% for both Data Stores. It was concluded that both Data Stores would work as an Online Feature Store. But NDB achieved better Read performance, which is one of the most important factors for this type of Feature Store. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)