Developing Deep Learning Algorithms on an Embedded Platform

Date: September 3, 2019

Author: Akshay Srinivasa

Machine learning - and specifically Deep learning - can be easily termed as the buzz word of the decade. And rightly so! Deep learning is transforming the way we perceive and act on things, and making our lives simpler. Mundane tasks are made easier or automated. Complex problems for ADAS systems, surveillance, security, healthcare, and retail are solved by bringing intelligence and autonomy that mimics humans. This article explains deep learning in brief and then lists down a step-by-step procedure to develop deep learning algorithms on embedded platforms.

What is deep learning?

Deep learning is a sub-set of Machine Learning (ML) methods which is based on Artificial Neural Networks. The human brain is amazing at learning how to analyse real-time data such as visual or audio inputs. Deep Learning mimics the human brain to learn and analyse real-world data and provide solutions. It can perform predictive work faster more efficiently when compared to other techniques, from something as simple as identifying customer satisfaction with help of face recognition to identifying various objects on the road for an autonomous vehicle. 

A typical Deep learning solution has two components: A training part and an inferencing part. The training part is usually done offline using vast dataset which mimics the final application. Once the training part is completed, an inferencing part is developed and deployed on platforms best suited for real-world application. The platforms on which training and inferencing occur can be vastly different, depending upon the end application constraints.

Bringing deep learning to the edge: Deploying on embedded platforms

Deep learning tends to be computationally intensive and needs a massive amount of data processing for both training and inferencing. While training can be done offline on powerful PCs and cloud servers, inferencing is usually expected to be done real-time and closer to the edge. For example, for many applications – from traffic sign detection to autonomous robots – the inferencing must happen locally to reduce the latency and to improve reliability. Real-time and local deployment of deep learning algorithms lends itself to selecting the right platforms with several constraints pertaining to size, performance, power consumption of embedded systems. It is typically the application that governs the right embedded platform to be selected while addressing the size-performance-power-cost trade-off. 

Steps to build Deep Learning Algorithms on an Embedded Platform

Once an embedded platform is selected, there are five key steps in developing and deploying deep learning algorithms - from data collection to final porting of the deep learning algorithms on an embedded platform. 

Problem definition

Primary step in developing any deep learning algorithm is to understand the problem or the end application clearly and define clear products requirements.

Data Collection

The Next crucial step of deep learning is data collection, annotation and augmentation, all of which define the accuracy one can achieve.

Choosing the right Framework

Depending on the problem statement and selected embedded platform, deep learning framework needs to be selected.

DL algorithm design

Depending on the nature of the problem, modalities operating on and platform capabilities, a complexity aware DL model is used.


Using deep compression to simplify the typically over-parameterized models, without losing accuracy.

Embedded Porting

Porting the developed deep learning algorithms on the selected embedded platform while optimizing for platform architecture and using supported inference engines.

Problem definition

Any algorithm development involves a clear understanding of the problem that one wishes to solve, the same is applicable with deep learning on an embedded platform. This involves answering all the question of the problem like what, how and why. 

Data collection

Data plays a crucial role in deep learning, not only for training the model but also for testing the trained and tuned algorithm. There are quite a few open-source data sets available used by researchers. Developers, however, can choose to use a custom data set which they collect themselves and annotate the same depending on the application.

Selecting the right framework

Machine learning frameworks like Caffe, Keras, and TensorFlow are known as the backbone of machine learning. These frameworks are different in many aspects, such as the way they handle data and the implementation of various operations. For example, Caffe is layer-based, but TensorFlow is operation based. There is no such thing as a “right framework”, one’s choice completely depends on the application. Things like the amount of data, computing infrastructure available, type of network to be trained, and targeted embedded platform will ultimately decide on the framework.

Deep Learning algorithm design

The building blocks for most of the deep learning models include Feedforward Convolutional Neural Networks, RNNs, LSTMs and Structured Learning paradigms like CRFs and HMMs.  However, the actual architecture of a deep learning model is typically decided based on:

  • the nature of the problem (whether Information retrieval, Inference or Regression)
  • Perceived representation complexity of the problem
  • The modalities operating on (Spatial, Temporal, multiple sensors), and 
  • The computational bandwidth of the Target platform (MACs/FLOPS available) 

Optimizing the deep learning algorithms 

State -of -art deep learning models have hundreds of millions of coefficients and do billions of operations, making them both computationally and memory intensive. Typically, deep learning models are over-parameterized and have significant redundancy. This problem is solved by deep compression.

Deep compression reduces the model size of deep networks without losing original accuracy. Deep compression has multiple stages. The first stage is pruning, the network is pruned by removing redundant connections. At the second stage sparsity of coefficients is addressed followed by thresholding. The final stage is dynamic quantization, which serves as the middle ground between accuracy and complexity.

Deep compression reduces network size and makes the required storage small (a few megabytes) so that all weights can be cached on-chip instead of going to off-chip DRAM, which is slow and power-consuming.

Developing an embedded version of deep learning algorithms

The power of deep learning can be truly appreciated when running a complex computer vision problem on a battery-powered device in real-time. This is possible due to the massive scaling obtained by running the neural network operations in parallel on the GPUs and due to the ever-increasing SIMD width on CPUs and DSPs embedded in the latest mobile devices.

The chip makers have realized the potential of CNNs and are now focussing on further improving their performance using their custom-built inference engines. These inference engines understand both the architecture of the neural network and the device on which the network is to be run. They use the kernels provided by the chip maker for each layer, instead of the kernels that come with Caffe or Tensorflow frameworks. With the help of these inference engines, deployment of complex deep learning algorithms on embedded platforms is massively simplified.


We are witnessing an increasing adoption of deep learning in many intelligent applications spanning autonomous vehicles, surveillance systems, retail analytics, healthcare and more. For the reasons of reliability, latency, cost, and performance, it is imperative that deep learning algorithms must be deployed locally on embedded platforms. However, embedded platforms come with their constraints pertaining to performance, memory and power consumption. Supported frameworks and tools on these platforms may also vary. The step-by-step procedure of data management, selecting the right framework, deep learning algorithm design, optimization and embedded porting can become a guiding framework to deploy deep learning algorithms on embedded platforms. 

Further Reading

By submitting this form, you authorize PathPartner to contact you with further information about our relevant content, products and services. You may unsubscribe any time. We are committed to your privacy. For more details, refer our Privacy Policy

Camera & IoT

By submitting this form, you authorize PathPartner to contact you with further information about our relevant content, products and services. You may unsubscribe any time. We are committed to your privacy. For more details, refer our Privacy Policy

Notify of
Inline Feedbacks
View all comments
Back to Top