Gate Recurrent Unit Neural Networks for Hearing Instruments
Abstract: Gated Recurrent Unit (GRU) neural networks have gained popularity for applications such as keyword spotting, speech recognition and other artificial intelligence applications. Typically for most applications training and inference is performed on cloud servers, and the result are transferred to the power constrained device, e.g., an hearing instrument (HI). This approach has disadvantages such as latency and connectivity, privacy concern, and high energy cost per bit for real-time data transfer. Therefore, there is a strong demand to move inference from cloud to power constraint devices. However, executing inference on HI introduces many challenges in terms of throughput, power budget, and memory footprint. This research investigate how efficient it is to execute inference on a dedicated hardware accelerator, rather than using an existing audio digital signal processor (xDSP in Oticon’s HI). The two approaches are compared in terms of area, power, energy dissipation and total clock cycles required to perform an inference. Straightforward implementation of nonlinear activation function is expensive in hardware, therefore, different methods of approximation are evaluated. Out of different approximation algorithms, fast sigmoid and fast tanh approaches were chosen. A pretrained keyword spotting (KWS) model was used. However, it exceeds the memory space available on xDSP. Instead, three small GRU networks were trained and executed on xDSP to approximate energy dissipation and clock cycle count if a bigger network was run on the xDSP. Precision needed to store and compute data was reduced to minimize storage needed keeping detection accuracy in mind. By reducing wordlength from 32-bit to 8-bit for network parameters, memory space required was reduced by 4 times while accuracy decreased from 91% to 88%. The GRU inference runs on per layer basis, data flow was optimized to achieve significant reduction in area and power. The xDSP needs around 2× more clock cycles to complete a full network inference for a benchmark keyword spotting neural network compared to dedicated hardware accelerator. The energy dissipation increased by around 10× while using Oticon’s xDSP processor instead of a dedicated accelerator. The xDSP is capable of executing GRU network with upto 40 neurons per layer, but for bigger networks hardware accelerator is a better solution. All in all, the dedicated accelerator solution has the best performance from the explored solution and can be integrated in HI to compute neural networks.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)