Embedded Vision Application – Design approach for real time classifiers
April 26, 2016
Overview of classification technique
Object detection/classification is a supervised learning process in machine vision to recognize patterns or objects from data or image. It is a major component in Advanced Driver Assistance Systems (ADAS) as it is used commonly to detect pedestrians, vehicles, traffic signs etc.
Offline classifier training process fetches sets of selected data/images containing objects of interest, extract features out of this input and maps them to corresponding labelled classes to generate a classification model. Real time inputs are categorized based on the pre-trained classification model in an online process which finally decides whether the object is present or not.
Feature extraction uses many techniques like Histogram of Oriented Gradients (HOG), Local Binary Pattern (LBP) features extracted using integral image etc. Classifier uses sliding window, raster scanning or dense scanning approach to operate on an extracted feature image. Multiple image pyramids are used (as shown in part A of Figure 1) to detect objects with different sizes.
Computational complexity of classification for real-time applications
Dense pyramid image structures with 30-40 scales are also being used for getting high detection accuracy where each pyramid scale can have single or multiple levels of features depending upon feature extraction method used. Classifiers like AdaBoost, use random data fetching from various data points located in a pyramid image scale. Double or single precision floating points are used for higher accuracy requirements with computationally intensive operations. Also, significantly higher number of control codes are used as part of classification process at various levels. These computational complexities make classifier a complex module to be designed efficiently to achieve real-time performance in critical embedded systems, such as those used in ADAS applications.
Consider a typical classification technique such as AdaBoost (adaptive boosting algorithm) which does not use all the features extracted from sliding window. That makes it computationally less expensive compared to a classifier like SVM which uses all the features extracted from sliding windows in a pyramid image scale. Features are of fixed length in most of the feature extraction techniques like HOG, Gradient images, LBP etc. In case of HOG, features contain many levels of orientation bins. So each pyramid image scale can have multiple levels of orientation bins and these levels can be computed in any order as shown in part B and C of Figure 1.
Design constraints for porting classifier to ADAS SoC
Object classification is generally categorized as a high level vision processing use case as it operates on extracted features generated by low and mid-level vision processing. It requires more control codes by nature as it involves comparison process at various levels. Also, as mentioned earlier it involves precision at double/float level. These computational characteristics depict classification as a problem for DSP rather than a problem for vector processors which has more parallel data processing power or SIMD operations.
Typical ADAS processors, such as Texas Instruments' TDA2x/TDA3x SoC, incorporate multiple engines/processors targeted for high, mid and low level vision processing. TMS320C66x DSP in TDA2x SoC has fixed and floating-point operation support, with maximum 8-way VLIW to issue up to 8 new operations every cycle, with SIMD operations for fixed point and fully-pipelined instructions. It has support for up to 32, 8-bit or 16-bit multiplies per cycle, up to eight, 32-bit multiplies per cycle. EVE processor of TDA2x has 512-bit Vector Coprocessor (VCOP) with built-in mechanisms and vision-specialized instructions for concurrent, low-overhead processing. There are three parallel flat memory interfaces each with 256-bit load-store memory bandwidth providing a combined 768-bit wide memory bandwidth. Efficient management of load/store bandwidth, internal memory, software pipeline, integer precisions are major design constraints for achieving maximum throughput from these processors.
Classifier framework can be redesigned/modified to adapt to vector processing requirements thereby processing more data in one instruction or achieving more SIMD operations.
Addressing major design constraints in porting classifier to ADAS SoC
Load/Store bandwidth managementEach pyramid scale can be rearranged to meet limited internal memory. Functional modules and regions can be selected and arranged appropriately to limit DDR load/store bandwidth to the required level.
Efficient utilization of limited internal memory and cacheImage can be processed on optimum sizes and memory requirements should fit into hardware buffers for efficient utilization of memory and computation resources.
Software pipeline design for achieving maximum throughput
Some of the techniques that can be used to achieve maximum throughput from software pipelining are mentioned below.
- Loop structure and its nested levels should fit into the hardware loop buffer requirements. For example C66x DSP in TDA3x has restrictions over its SPLOOP buffer such as initiation interval should be
- Unaligned memory loads/store should be avoided as it is computationally expensive and its computation cycles are twice compared to aligned memory loads/stores in most of the cases.
- Data can be arranged or compiler directives and options can be set to get maximum SIMD load, store and compute operations.
- Double precision operations can be converted to floating point or fixed point representations but retraining of offline classifier should be done upon these precision changes.
- Inner most loop can be made simple without much control codes to avoid register spilling and register pressure issues.
- Division operations can be avoided with corresponding table loop up multiplication or inverse multiplication operations.
Classification for real-time embedded vision applications is a difficult computational problem due to its dense data processing requirements, floating point precision requirements, multilevel control codes and data fetching requirements. These computational complexities involved in classifier design limits its vector processing power significantly. But classifier framework on a target platform can be redesigned/modified to leverage platform architecture vector processing capability by efficiently utilizing techniques such as load/store bandwidth management, internal memory and cache management and software pipeline design.
TDA2X, A SOC OPTIMIZED FOR ADVANCED DRIVER ASSISTANCE SYSTEMS Dr. Jagadeesh Sankaran, Senior Member Technical Staff, Texas Instruments Incorporated. Dr. Nikolic Zoran, Member Group Technical Staff, Texas Instruments Incorporated. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)