Squeezing and Accelerating Neural Networks on Resource Constrained Hardware for Real Time Inference

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: As the internet user base increases over the years, so do the logistic difficulties of handling higher and higher volumes of data. This large amount of information is now being exploited by Artificial Intelligence algorithms to deliver value to our society on a global scale. Among all the algorithms employed, the widespread adoption of Neural Networks in industrial settings is promoting the automation of tasks previously unsolvable by computers. As of today, efficiency limits the applicability of such technology on Big Data and efforts are being put to develop new acceleration solutions.In this project, we analyzes the computational capabilities of a multicore Digital Signal Processor called the EMCA (Ericsson Many-Core Architecture) when it comes to executing Neural Networks. The EMCA is a proprietary chip used for real-time processing of data in the pipeline of a Radio Base Station.We developed an inference engine to run Neural Networks on the EMCA. The software of such engine has been produced using a proprietary operating system called Flake OS, which runs on the EMCA. On top of the inference engine, we wrote a neural network squeezing pipeline based on quantization. On MNIST, the quantization algorithm can reduce the size of the networks by 4x folds with sub 1% accuracy degradation. The inference engine has been optimized to exploit the quantization utility and can run quantized neural networks. Tests have been done to understand the direct implications of using such algorithm. We show that the quantization is indeed beneficial for inference on DSPs.Finally, the EMCA has demonstrated state of the art computational capabilities for neural network inferencing.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)