Large Scale Privacy-Centric Data Collection, Processing, and Presentation

University essay from Luleå tekniska universitet/Institutionen för system- och rymdteknik

Abstract: It has become an important part of business development to collect statistical data from online sources. Information about users and how they interact with an online source can help improving the user experience and increasing sales of products. Collecting data about users has many benefits for the business owner, but it also raises privacy issues since more and more information about users are spread over the internet. Tools that collect statistical data from online sources exists, but using such tools gives away the control over the data collected. If a business implements its own analytics system, it is easier to make it more privacy centric and the control over the data collected is kept.  This thesis examines what techniques that are most suitable for a system whose purpose is to collect, store, process, and present large-scale privacy centric data. Research about what technique to use for collecting data and how to keep track of unique users in a privacy centric way has been made as well as research about what database to use that can handle many write requests and store large scale data. A prototype was implemented based on the research, where JavaScript tagging is used to collect data from several online sources and cookies is used to keep track of unique users. Cassandra was chosen as database for the prototype because of its high scalability and speed at write requests. Two versions of the processing of raw data into statistical reports was implemented to be able to evaluate if the data should be preprocessed or if the reports could be created when the user asks for it.   To evaluate the techniques used in the prototype, load tests of the prototype was made where the results showed that a bottleneck was reached after 45 seconds on a workload of 600 write requests per second. The tests also showed that the prototype managed to keep its performance at a workload of 500 write requests per second for one hour, where it completed 1 799 953 requests. Latency tests when processing raw data into statistical reports was also made to evaluate if the data should be preprocessed or processed when the user asks for the report. The result showed that it took around 30 seconds to process 1 200 000 rows of data from the database which is too long for a user to wait for the report. When investigating what part of the processing that increased the latency the most it showed that it was the retrieval of data from the database that increased the latency. It took around 25 seconds to retrieve the data and only around 5 seconds to process it into statistical reports. The tests showed that Cassandra is slow when retrieving many rows of data, but fast when writing data which is more important in this prototype.

