Kernel Mechanisms for Efficient GPU Accelerated Deep Neural Network Inference on Embedded Devices