Leerec : A scalable product recommendation engine suitable for transaction data.
Abstract: We are currently living in the Internet of Things (IoT) era, which involves devices that are connected to Internet and are communicating with each other. Each year, the number of devices increases rapidly, which result in rapid growth of data that is generated. This large amount of data is sometimes titled as Big Data, which is generated from different sources, such as log data of user behavior. These log files can be collected and analyzed in different ways, such as creating product recommendations. Product recommendations have been around since the late 90s, when the amount of data collected were not at the same level as it is today. The aim of this thesis has been to investigating methods to process and create product recommendations to see how well they are adapted for Big Data. This has been accomplished by three theory studies on how to process user events, how to make the product recommendation algorithm called collaborative filtering scalable and finally how to convert implicit feedback to explicit feedback (ratings). This resulted in a recommendation engine consisting of Apache Spark as the data processing system, which had three functions: read multiple log files and concatenate log files for each month, parsing the log files of the user events to create explicit ratings from the transactions and create four types of recommendations. The NoSQL database MongoDB was chosen as the database to store the different types of product recommendations that was created. To be able to get the recommendations from the recommendation engine and the database, a REST API was implemented which can be used by any third-party. What can be concluded from the results of this thesis work is that the system that was implemented is partial scalable. This means that Apache Spark was scalable for both concatenating files, parse and create ratings and also create the recommendations using the ALS method. However, MongoDB was shown to be not scalable when managing more than 100 concurrent requests. Future work involves making the recommendation engine distributed in a multi-node cluster to utilize the parallelization of Apache Spark. Other recommendations include considering other NoSQL databases that might be more scalable than MongoDB.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)