Parallelizing the HEVC Encoder

High Efficiency Video Codec is the next generation video codec which promises to cut the bit-rate in half when compared to the H264/AVC codec. HEVC has bigger block sizes, more prediction modes and a whole lot of other algorithms to achieve the said target. However, this comes with substantially higher costs in terms of computational power requirement. HEVC, as it stands today, needs about 8 times more computational power to deliver twice the compression ratio.
Keeping pace with Moore’s law, the general purpose processors have been cramming more logic in lesser space, thanks to the evolving manufacturing processes that are adding more and more computational power. This computational power is available in the form of increased cores per processor. So the entire focus of HEVC encoders is not just having the fastest algorithm with the best quality, but also on achieving the one which can be executed in parallel on multiple cores with a minimal penalty on the quality. Server grade processors have been cramming more than 18-cores into a single socket, increasing the importance of system design in HEVC encoders. This provides a strong new impetus to transform all algorithms and code-flows to maximally utilize the entire available computational power across all the cores.

So how does HEVC fare in a multicore scenario?

HEVC has included many tools which are parallel processing friendly. These multi-core friendly tools are discussed in detail in this whitepaper:
Parallelizing HEVC encoder using slices and tiles is very simple. An input frame is divided into a number of slices, equivalent to the number of cores or threads required to process. This results in completion of the best possible multicore encoder in the shortest duration with a very good multicore scaling factor.
Scaling factor on a multicore system is the speedup achieved over a single core, by using multiple cores for the same job. If a job takes 1s on a single core but takes only 0.5s on an N-core system, the scaling factor in such case is 2. If the N-core is a dual core system, an ideal scaling of 100% is achieved. But if it is a quad core system, only 50% of scaling is achieved. It is almost impossible to achieve ideal scaling unless the job being done contains further independent jobs.
Figure 1 : Slices and Tiles
An HEVC frame, when partitioned into slices and tiles and encoded on different cores, should have an ideal scaling because each slice and tile is independent of each other. But due to the total complexity of the blocks being encoded in the slice or tile, this cannot be achieved. Each slice or tile varies in complexity, and hence, different cores take different amounts of time to encode them. The amount of time that the threads wait after their tasks are done is inversely proportional to the complexity of the slice or tile they encode. And the wait time too is inversely proportional to the scaling factor; i.e., the more the core waits, the less is the scaling. This is further aggravated by the fast algorithms that are present in encoders which let them predict the encoded modes accurately.
Also, by encoding with slices or tiles the frame basically develops into a collection of segments of independently encoded streams which have no interlink between them. This will have a large impact on visual quality with visible compression artefacts at the edges of slices or tiles. These artefacts can be partially avoided by applying de-blocking and the new SAO filters across the slices or tiles. But when encoding a high motion sequence with a challenging bitrate, the artefacts will definitely be noticeable.
The challenge in multicore HEVC encoding is always achieving the best possible scaling while sacrificing the least possible video quality. Performance and quality measures of a video encoder are always in battle with each other, and with multicore the battle gets more ammunition. But diplomacy, which is, Wavefront Parallel Processing (WPP) tries to keep peace to a certain extent.


Wavefront Parallel Processing
Figure 2 : Wavefront Parallel Processing
One of the major serial processing block in any video encoder is the block by block arithmetic coding of the signals and transform coefficients in a raster scan order. Again with slices and tiles, this can be parallelized but with a penalty on the bits taken to encode. With Wavefront Parallel Processing or entropy sync, the arithmetic coding is parallelized with a catch. The bottom row takes the seed context from its top right neighbor before starting its row encoding. This results in a penalty which is lesser than slices or tile encoding.
The same approach to parallelize the mode decision encoder is taken where each CTU block waits for the completion of its top right neighbor.
Parallel row HEVC encoding
Figure 3 : Parallel row HEVC encoding
The above figure shows a quad core encoder, whose pipeline is built up and is operating in a steady state. Before starting the encoding of each CTU, the completion of top right is checked and since the top row is ahead of the current row, the check is almost always positive. This design preserves the data dependencies present in the original encoder and gives the best possible performance while sacrificing least amount of quality. This design has an overhead of pipe up and pipe down stages, but when encoding a huge number of CTUs, they can be neglected. Also, there will be small scaling losses due to differences in CTU encoding times.
The multicore parallelizing schemes presented here scale up with more cores and bigger input resolutions, but with a drop in scaling efficiency. Having cores on more than one socket, the locality of the data used by the encoder contributes a major chunk in the encoder performance. PathPartner’s multicore HEVC encoder is NUMA optimized to achieve best possible scaling on systems with more than one socket.
Pathpartner author Shashikantha
Shashikantha Srinivas
Technical Lead

Embedded Vision Application – Design approach for real time classifiers

Overview of classification technique

Object detection/classification is a supervised learning process in machine vision to recognize patterns or objects from data or image. It is a major component in Advanced Driver Assistance Systems (ADAS) as it is used commonly to detect pedestrians, vehicles, traffic signs etc.
Offline classifier training process fetches sets of selected data/images containing objects of interest, extract features out of this input and maps them to corresponding labelled classes to generate a classification model. Real time inputs are categorized based on the pre-trained classification model in an online process which finally decides whether the object is present or not.
Feature extraction uses many techniques like Histogram of Oriented Gradients (HOG), Local Binary Pattern (LBP) features extracted using integral image etc. Classifier uses sliding window, raster scanning or dense scanning approach to operate on an extracted feature image. Multiple image pyramids are used (as shown in part A of Figure 1) to detect objects with different sizes.

Computational complexity of classification for real-time applications

Dense pyramid image structures with 30-40 scales are also being used for getting high detection accuracy where each pyramid scale can have single or multiple levels of features depending upon feature extraction method used. Classifiers like AdaBoost, use random data fetching from various data points located in a pyramid image scale. Double or single precision floating points are used for higher accuracy requirements with computationally intensive operations. Also, significantly higher number of control codes are used as part of classification process at various levels. These computational complexities make classifier a complex module to be designed efficiently to achieve real-time performance in critical embedded systems, such as those used in ADAS applications.
Consider a typical classification technique such as AdaBoost (adaptive boosting algorithm) which does not use all the features extracted from sliding window. That makes it computationally less expensive compared to a classifier like SVM which uses all the features extracted from sliding windows in a pyramid image scale. Features are of fixed length in most of the feature extraction techniques like HOG, Gradient images, LBP etc. In case of HOG, features contain many levels of orientation bins. So each pyramid image scale can have multiple levels of orientation bins and these levels can be computed in any order as shown in part B and C of Figure 1.
pyramid image
Figure 1: Pyramid image scales with multiple orientation bins levels

Design constraints for porting classifier to ADAS SoC

Object classification is generally categorized as a high level vision processing use case as it operates on extracted features generated by low and mid-level vision processing. It requires more control codes by nature as it involves comparison process at various levels. Also, as mentioned earlier it involves precision at double/float level. These computational characteristics depict classification as a problem for DSP rather than a problem for vector processors which has more parallel data processing power or SIMD operations.
Typical ADAS processors, such as Texas Instruments’ TDA2x/TDA3x SoC, incorporate multiple engines/processors targeted for high, mid and low level vision processing. TMS320C66x DSP in TDA2x SoC has fixed and floating-point operation support, with maximum 8-way VLIW to issue up to 8 new operations every cycle, with SIMD operations for fixed point and fully-pipelined instructions. It has support for up to 32, 8-bit or 16-bit multiplies per cycle, up to eight, 32-bit multiplies per cycle. EVE processor of TDA2x has 512-bit Vector Coprocessor (VCOP) with built-in mechanisms and vision-specialized instructions for concurrent, low-overhead processing. There are three parallel flat memory interfaces each with 256-bit load-store memory bandwidth providing a combined 768-bit wide memory bandwidth. Efficient management of load/store bandwidth, internal memory, software pipeline, integer precisions are major design constraints for achieving maximum throughput from these processors.
Classifier framework can be redesigned/modified to adapt to vector processing requirements thereby processing more data in one instruction or achieving more SIMD operations.

Addressing major design constraints in porting classifier to ADAS SoC

Load/Store bandwidth management

Each pyramid scale can be rearranged to meet limited internal memory. Functional modules and regions can be selected and arranged appropriately to limit DDR load/store bandwidth to the required level.

Efficient utilization of limited internal memory and cache

Image can be processed on optimum sizes and memory requirements should fit into hardware buffers for efficient utilization of memory and computation resources.

Software pipeline design for achieving maximum throughput

Some of the techniques that can be used to achieve maximum throughput from software pipelining are mentioned below.
  • Loop structure and its nested levels should fit into the hardware loop buffer requirements. For example C66x DSP in TDA3x has restrictions over its SPLOOP buffer such as initiation interval should be
  • Unaligned memory loads/store should be avoided as it is computationally expensive and its computation cycles are twice compared to aligned memory loads/stores in most of the cases.
  • Data can be arranged or compiler directives and options can be set to get maximum SIMD load, store and compute operations.
  • Double precision operations can be converted to floating point or fixed point representations but retraining of offline classifier should be done upon these precision changes.
  • Inner most loop can be made simple without much control codes to avoid register spilling and register pressure issues.
  • Division operations can be avoided with corresponding table loop up multiplication or inverse multiplication operations.


Classification for real-time embedded vision applications is a difficult computational problem due to its dense data processing requirements, floating point precision requirements, multilevel control codes and data fetching requirements. These computational complexities involved in classifier design limits its vector processing power significantly. But classifier framework on a target platform can be redesigned/modified to leverage platform architecture vector processing capability by efficiently utilizing techniques such as load/store bandwidth management, internal memory and cache management and software pipeline design.


TDA2X, A SOC OPTIMIZED FOR ADVANCED DRIVER ASSISTANCE SYSTEMS Dr. Jagadeesh Sankaran, Senior Member Technical Staff, Texas Instruments Incorporated. Dr. Nikolic Zoran, Member Group Technical Staff, Texas Instruments Incorporated. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)
Pathpartner author Sudheesh
Sudheesh TV
Technical Lead
Pathpartner author Anshuman
Anshuman S Gauriar
Technical Lead

How to build Angstrom Linux Distribution for Altera SoC FPGA with OPEN CV & Camera Driver Support

This blog will guide you through the steps to build Linux OS with OpenCV & Camera Driver support for Altera SoC FPGA.
If your Real time Image processing applications like Driver Monitoring system on SoC FPGA’s are dependent on Open CV, you have to develop Open CV build environment on the target board. This blog will guide you through the steps to build Linux OS with OpenCV & Camera Driver support for Altera SoC FPGA.
In order to start building the first Linux distribution for the Altera platform, you must first install the necessary libraries and packages. Follow the Initialization steps for setting up the PC host
The required packages to be installed for Ubuntu 12.04 are
$ sudo apt-get update $ sudo apt-get upgrade $ sudo apt-get install sed wget cvs subversion git-core coreutils unzip texi2html texinfo libsdl1.2-dev docbook-utils gawk python-pysqlite2 diffstat help2man make gcc build-essential g++ desktop-file-utils chrpath libgl1-mesa-dev libglu1-mesa-dev mercurial autoconf automake groff libtool xterm *


Please note that * indicates that the command is one continuous line of text. Make sure that the command is in one line when you are pasting it.
If the host machine runs 64 bit version of the OS, then you need to install the following additional packages:
$ sudo apt-get install ia32-libs
On Ubuntu 12.04 you will also need to make /bin/sh point to bash instead of dash. You can accomplish this by running: and selecting ‘No’.
sudo dpkg-reconfigure dash
Alternatively you can run:
sudo ln -sf bash /bin/sh
However, this is not recommended, as it may get undone by Ubuntu software updates.

Angstrom Buildsystem for Altera SOC: (Linux OS)

Download the scripts needed to start building the Linux OS. You can download the scripts for the angstrom build system from
Unzip the files to angstrom-socfpga folder:
$ unzip –d angstrom-socfpga $ cd angstrom-socfpga
These are the setup scripts for the Angstrom buildsystem. If you want to (re)build packages or images for Angstrom, this is the thing to use.
The Angstrom buildsystem uses various components from the Yocto Project, most importantly the Openembedded buildsystem, the bitbake task executor and various application/ BSP layers.
Navigate to the source folder, and comment the following line in the layer.txt file
$ cd source $ gedit layers.txt & meta-kd4, git,master,f45abfd4dd87b0132a2565499392d49f465d847 * $ cd .. (Navigate back to the head folder)
To configure the scripts and download the build metadata
$ MACHINE=socfpga_cyclone5 ./ config socfpga_cyclone5
After the build metadata, you can download the meta-kde4 from the below link and place it in the sources folder, as this was earlier disabled in the layers.txt file
Source the environment file and use the below commands to start a build of the kernel/bootlaoder/rootfs:
$. /environment-angstrom $ MACHINE=cyclone5 bitbake virtual/kernel virtual/bootloader console-image
Depending on the type of machine used, this will take a few hours to build. After the build is completed the images can be found in:
After the build is completed, you will find the u-boot, dtb, rootfs and kernel image files in the above folder.

Adding the OPENCV Image to the rootfs:

To add OpenCV to the console image(rootfs) we need to modify the local.conf file in the conf folder:
$ cd ~/angstrom-socfpga/conf $ gedit local.conf &
In the local.conf file navigate to the bottom of the file and add the following lines and save the file:
IMAGE_INSTALL += “ opencv opencv-samples opencv-dev opencv-apps opencv-samples-dev opencv-static-dev “
Then build the console image again using the following command:
$ cd .. $ MACHINE=cyclone5 bitbake console-image
After the image is built the rootfs will contain all necessary OpenCV libs for development and running opencv based applications.

Enabling Camera Drivers in the Kernel :

The Linux Kernal v3.10 has an in built UCV camera driver which supports a large number of USB cameras. In order to enable it, you need to configure the kernel using the menuconfig option:
$ MACHINE=cyclone5 bitbake virtual/kernel –c menuconfig
The above command opens a config menu window. From the menuconfig window enable the following to enable UVC:
Device Drivers Multimedia support Media USB Adapters [*] USB Video Class [*] UVC input events device support [*]
Save and exit the config menu then execute the following command:
$ MACHINE=cyclone5 bitbake virtual/kernel
The new kernel will be build with the UVC camera drivers enabled and will be available in the /deploy/cyclone5 folder.
For the camera to work, the coherent pool must be set to 4M, this can be done as follows:-

U-Boot Environment Variables

Boot the board, pressing any key to stop at U-Boot console. The messages dispayed on the console will look similar to the following listing:
U-Boot SPL 2013.01.01 (Jan 31 2014 – 13:18:04) BOARD: Altera SOCFPGA Cyclone V Board SDRAM: Initializing MMR registers SDRAM: Calibrating PHY SEQ.C: Preparing to start memory calibration SEQ.C: CALIBRATION PASSED ALTERA DWMMC: 0   U-Boot 2013.01.01 (Nov 04 2013 – 23:53:26)   CPU : Altera SOCFPGA Platform BOARD: Altera SOCFPGA Cyclone V Board DRAM: 1 GiB MMC: ALTERA DWMMC: 0 In: serial Out: serial Err: serial Net: mii0 Warning: failed to set MAC address   Hit any key to stop autoboot: 0 SOCFPGA_CYCLONE5 #

Configuration of U-Boot Environment Variables

SOCFPGA_CYCLONE5 #setenv bootargs console=ttyS0,115200 vmalloc=16M coherent_pool=4M root=${mmcroot} rw rootwait;bootz ${loadaddr} – ${fdtaddr} *

Save of U-Boot Environment Variables


Boot Kernel

Following all the above guidelines, you should be able to build Angstrom Linux Distribution for Altera SoC FPGA with OPEN CV & Camera Driver Support. This build was successfully implemented on Altera Cyclone V SoC.
Pathpartner author Idris Tarwala
Idris Iqbal Tarwala
Sr. VLSI Design Engineer

HEVC without 4K

The consumer electronics industry wants to increase the number of pixels on every device by 4 times. Without HEVC this will quadruple the storage requirements for video content and clog the already limited network bandwidths. To bring 4K or UHD to the consumer, HEVC standard is essential as it can cut these requirements at least by half. There is no doubt about it. But what the consumer electronics industry wants might not essentially be what the consumer wants. Case in point is 3D TVs, which failed to take off due to the lack of interest from the consumers and original content creators. The same thing might happen to 4K. If the 4K/UHD devices fail to take off, where will HEVC be?
Here, we make a case for HEVC.
HEVC without 4k
According to the data released by CISCO, video will be the highest consumer of network bandwidth in the years to come and video has the highest growth rate across all segments. With HEVC based solutions these over clogged networks can take a breather.
Let’s look at how deploying HEVC solutions is advantageous to the major segments of the video industry.

1. Broadcast:

Broadcast industry is a juggernaut which moves very slowly. They will have the biggest advantages in cost saving and consumer satisfaction by adopting HEVC. But, they have to make the highest initial investment as well. There are claims that UHD alone does not add much to the visual experience at a reasonable TV viewing distance for a 40-50 inch TV. The industry is pushing for additional video enhancement tools such as Higher Dynamic Range, more color information and better color representation (BT 2020). UHD support in broadcast may not be feasible without upgrading the infrastructure and with the advantage that HEVC can provide to UHD video including it in the infrastructure upgrade is the optimal choice. But if UHD fails to attract the consumer then what will happen to HEVC? Without UHD broadcast becoming a reality, introduction of HEVC into broadcast infrastructure can be delayed heavily. On the other hand contribution encoding can be heavily benefited from HEVC with reasonable change in infrastructure. But whether broadcast companies adapt to HEVC just for contribution encoding without the support of UHD depends purely on cost of adaption and cost saving by the adaption.

2. Video surveillance:

Surveillance applications are increasing each day and now there is an added overhead for backups on the cloud. The advantage with video surveillance applications is that it does not need backward compatibility. Hence HEVC is an ideal solution for the industry to cut the storage costs for the current systems or to keep it at the same level for new generation of systems which can store more video surveillance data at higher resolutions. ASIC developers are already developing HEVC encoders and decoders and it’s just a matter of time video surveillance systems based on HEVC hit the market. But upgrading current video surveillance systems to support HEVC may not be feasible without the hard struggle of making legacy hardware support HEVC.

3. Video Conference:

Video conference applications can be a tricky situation as it needs to be highly interoperable and it needs backward compatibility with the existing systems. Professional video conference systems might have to support both HEVC as well as earlier codecs for working with already available systems. On the other hand, general video conferencing solutions such as gtalk or Skype would have a problem of licensing HEVC. As none of the current day browsers or Operating systems (Except Windows 10 which most probably has just the decoder) has announced support for HEVC. But advantage that HEVC can bring into video conferencing application can be magnificent. Irrespective of advancement in bandwidth availability and with the introduction of high speed 3G and 4G services, quality of video conferencing experience has remained poor. This can be massively improved with the help of HEVC, which has the potential to enable HD video calling on a 3G network. With or without the help of UHD, at least the professional video conferencing systems will adapt HEVC unless another codec (likes of VP9, Daala) promises better advantages.

4. Storage, streaming and archiving:

Advantage of upgrading archived database to use HEVC needs no explanation. Imagine the number of petabytes of memory that can be saved if all archived videos are converted to HEVC. OTT players like Netflix are already on the process of upgrading to HEVC as it will help them to reduce the burden on ISP providers. And it will also help in reducing storage and transmission cost in video streaming application. Converting such huge data base from one format to another will not be easy. OTT and video streaming applications would need scalable HEVC where each video need to be encoded at different resolutions and different data rates. This would need multi instance encoders continuously running to encode these videos at very high quality. However cost saving in these application by adapting to HEVC is very huge and upgrading to HEVC in storage space will become inevitable.

5. End consumer:

The end consumer gets exposed to HEVC at different levels:
Decoders in the latest gadgets viz. TV, mobile phone, tablet, gaming console and web browsers. Encoders which are built in wherever there is a camera, be it in video calling on mobile phone, tablet or laptop and video recording in mobile phone or standalone camera.
It is difficult to convince the less tech savvy end consumer about the advantages of HEVC, but they are a huge market. In the US it costs around 1$ for 10 Mega Bytes of data. HEVC can indeed help the consumers with higher video quality for the same cost or half the cost for the current video quality. Since HEVC is important to the consumers, almost all chip manufacturers are supporting HEVC or it is in their road map. Even there are consumer electronic products currently in the market with HEVC encoders built in. Definitely HEVC support will be a differentiating factor for the end consumer looking for a new gadget.
Deploying HEVC based solutions will definitely yield gains in the long term after the profits due to the reduction in bandwidth overtake the loss of the initial investments. This is true with each of the segments which we have discussed. With 4K or UHD this initial investment can be merged to higher quality offerings and the costs could be offset. But without any change in the resolution or any other feature, the returns on HEVC investments are definitely high. When the entire video industry adopts the newer and better codec standard HEVC, the gains will multiply.
Pathpartner author Prashanth
Prashanth NS
Technical Lead
Pathpartner author Shashikantha
Shashikantha Srinivas
Technical Lead

Future trends in IVI solutions

In the immediate future we shall witness the next big phase of technology convergence of Automotive Industry with Information technology around epicenter of ‘Connectivity’. There are several reasons for this revolutionary fusion. The first one being: exponential rise in consumer demand. Connectivity on the road is not a luxury anymore for our internet savvy generation; who would be 40% of new automobile users. They are expecting more than just in vehicle GPS navigation. Some developed countries are mandating potential safety features like DSRC on Wi-Fi, 360 degree NLOS, V2V exchange info etc., to reduce the number of accidents and casualties. Inclusion of these safety features on the dashboard is another reason why IVI systems and solutions are the current hottest sell in automotive industry. With these triggering points, automobile manufacturers are now bridging the gap between non IVI technologies and existing dashboards solutions.
The current dashboard solutions are provided by software giants Apple and Google who are very well ahead in providing SDKs for IVI solutions. A few other companies also demonstrated dashboards in CES 2015 with IVI solutions, but CarPlay and Android Auto have got the world’s attention with their adaptability, portability and expandability for various car market segments. Market experts are predicting the key IVI solutions would definitely be Android Auto and CarPlay. These current IVI solutions are only the beginning of a fast evolving technology.
The experts are forecasting the evolution in IVI features is based on maturity of architecture. In primary phase, Smartphone attached dashboard models using connectivity models like USB and Bluetooth continue to be widely used. By 2017, around 60% vehicles are expected to get connected by smartphones. The reuse of smartphone computation power and mobile integrated technologies like telephony, Multimedia, GPS, storage, power management and voice based systems would be in focus. Following next is the progression in direction of connecting beyond car. The conjunction with IoT and cloud technologies would lead involvement of ecosystems like OEM services, HealthCare, education, smart energy systems. Further into future, external control and integration of car ecosystem would be possible including On Board Diagnostics (OBD) systems, usage of SoC for smarter, reliable and better performing features on dashboard, augmented reality and many more options.
The current IVI systems’ architecture is designed to be flexible in adapting future upgrades (such as changing your Mobile phone model, upgrade of OS etc.). IVI solution core can be broadly divided into three categories: Hardware, underlying Operating System and User experience enhancing applications. The architecture mainly constitutes of horizontal and vertical layers.
The vertical layer is made of:
  • Proprietary protocols.
  • Multimedia rendering solutions.
  • Core stack
  • Security and UI
Whereas the Horizontal layer includes:
  • Hardware
  • OS
  • Security Module
Proprietary Portable Protocols are rudiments of IVI connectivity modules and the UI development of IVI head units can be deployed with core engines of Java, Qt, and Gtk. For user connectivity and a better User experience Wi-Fi, LBTE, advance HMI, OBD tools and protocols help, which are still under development.
From a car manufacturer’s point of view centralized upgradable dashboard with all IVI solutions and distinct non IVI technologies integrated would be the prime factor in choosing a car. Here the middleware of dash board needs to play crucial support, integration for multiple IVI solutions providing device independence experience. The flexibility in amending the upcoming IVI technology like data interfaces for application developers are also deciding factors.
Although IVI core solutions are fundamental differentiating factors for end user, car OEMs need satisfactory single roof vendor owning expertise in various IVI, non IVI technologies including system architecture, OS development, protocol, hardware acquaintance and applications. Pathpartner, a product engineering service provider for Automotive Embedded Systems, embraces all these requirements very well. PathPartner with visionary leadership has already crossed the threshold into Automotive Infotainment industry, with a strong business model, which comprise of development activities like porting on multiple platforms such as Freescale, Texas Instruments on infotainment systems based on Android, Linux, and QNX, customized middleware for Android Auto, MirrorLink and CarPlay.
We at PathPartner, with a dedicated team of engineers equipped with niche expertise in embedded products, successfully delivered the certified applications from platforms such as iOS and Windows and continue to work on other IVI solutions for renowned OEM’s.
Pathpartner author Kaustubh
Kaustubh D Joshi

HEVC Compression Gain Analysis

HEVC, the future video codec needs no introduction as it is slowly being injected into multimedia market. Irrespective of very high execution complexity compared to its earlier video standards, there are few software implementations proven to be real time on multi-core architectures. While OTT players like Netflix are already on the process of migrating to HEVC, 4k TV sales are increasing rapidly as HEVC promises to make UHD resolution a reality. ASIC HEVC encoders and decoders are being developed by major SOC vendors that will enable HEVC in hand held battery operated devices in the near future. All these developments are motivated by one major claim

‘HEVC achieves 50% better coding efficiency compared to its predecessor AVC’.

HEVC compression gain
Initial experimental results justifies this claim and shows on an average 50% improvement using desktop encoders and approximately 35% improvement using real time encoders compared to AVC. This improvement is mainly achieved by set of new tools and modification introduced in HEVC but, is not limited to higher block sizes, higher transform sizes, extended intra modes and SAO filtering. You need to note that, there is 50% improvement on an average when results for a set of input contents are accounted. This does not ensure half the bit rate with same quality for every input content. Now, we will tear down this ‘average’ part of the claim and discuss in which situation HEVC is more efficient considering four main factors.
  1. Resolution
  2. Bit rate
  3. Content type
  4. Delay configuration

1. Resolution:

HEVC is expected to enable higher resolution video and to substantiate this, results prove higher coding efficiency gains at higher resolutions. At 4k resolutions, compression gains can be more than 50%. This makes it possible to encode 4k resolution contents with 30 frames per second at bit rates of 10 to 15 Mbps with decent quality. Reason behind such behavior is a feature/tool introduced in HEVC that contributes more than half the coding efficiency gains i.e. higher coding block sizes. At high resolution, larger block sizes leads to better compression as neighboring pixels have higher correlation among them. It is observed that 1080p sequences have on an average 3-4% better compression gains compared to their 720p counterparts.

2. Bitrate:

Encoder results indicate better compression gains at lower and mid-range bit rates compared to very high bit rates. At low QPs, transform coefficients contribute more than 80% of the bits. Having larger coding unit helps in saving bits used for MVD and other block header coding. Bigger transform blocks will have better compression gains as it has large number of coefficients for energy compaction. But at high bit rate, header bits takes a small percentage of the bit stream suppressing the gains from higher block sizes. And low QPs reduce the number of non-zero coefficients which limit the gains possible from high transform blocks.
We have observed that, for ParkJoy sequence BD-rate gains were 8% better at QP range of 32-40 compared to QP range of 24 – 32. Similar behavior has been found in most of the sequences.

3. Content type:

Video frames with uniform content has shown better compression gains as it aids higher block sizes. Encoder tend to select lower block sizes for frames with very high activity which makes the block division similar to AVC, reducing efficiency of HEVC. Lower correlation between pixels limits compression gains produced by larger transform blocks. Streams such as BlueSky having significant uniform content in a frame produced 10% better gains compared to high activity streams like ParkJoy.
On the other hand, videos with stationary contents or low motion videos produced better gains as higher block sizes are chosen in inter frames for such contents.

4 Delay Configuration:

Video encoders can be configured to use different GOP structures based on application needs. Here, we are going to analyze compression gains of different delay configurations w.r.t AVC. Firstly, all intra frame cases (used for low delay, high quality, error robust broadcast application) produce nearly 30 – 35% gain deviating from 50% claims of HEVC. Higher block sizes are not very effective in intra frames and gains produced with the help of other new tools such as additional intra modes, SAO etc., limits the average gain to 35%. Hence HEVC-Intra will not be 50% better than AVC-Intra. It is also observed that, Low delay IPP GOP configuration produces slightly better gains (Approximately 3-4% on an average) compared to Random Access GOP configuration with B frames. This could be purely due to the implementation differences in HM reference software.
Thus 50% gains cannot be achieved for all video contents but many encoder implementations have proven nearly 50% or more average gains compared to AVC. Such compression efficiency can have a major impact on broadcast and OTT applications where half the bandwidth consumption can be reduced, thus half the costs. Though other emerging video compression standards such as VPx, Daala claim similar gains, they are yet to be proven. It would be interesting to see how VPx, Daala can change the future of HEVC. But right now one thing is sure, HEVC is the future video codec and it is here!
Pathpartner author Prashanth
Prashant N S
Technical Lead

Analysis, Architectures and Modelling of HEVC Encoder

High Efficiency Video Codec (HEVC) also known as H.265 has become the talk of the town in video compression industry for the past two years as it promises to significantly improve the video experience. HEVC is the next generation video compression standard being developed by Joint Collaborative Team – Video coding (JCT-VC) formed by ITU-T VCEG and ISO/IEC MPEG standard bodies. HEVC claims to save 50% bit-rate for same video quality compared to its predecessor AVC. This comes at a huge cost of increased computation power requirement as complexity of HEVC is higher in multiple magnitude, compared to AVC. Computational complexity of HEVC decoder is expected to be twice as that of AVC, but HEVC encoder complexity might be 6x – 8x of AVC complexity.
HEVC is a block based hybrid video codec similar to earlier standards with few new tools to improve coding efficiency which will increase the computational complexity, as configuration of these new tools with appropriate data is required for every block.
This blog is written in two parts. In the first part, we discuss the possible implementation of HEVC encoder on different homogeneous and heterogeneous architectures, by comparing them for Video quality, Execution speed, Power efficiency, Memory requirement and Development cost. Modelling of the HEVC Encoder on various architecture will be discussed in Part 2 of the same blog.

Single CPU

Single core CPU solutions are mainly targeted to achieve best video quality and do not really focus on encoding time. These solutions are used in generating benchmark video sequences and in archiving applications which are mainly PC-based. Taking advantage of sequential processing, feedback from each stage can be utilized in further stages to enhance the video quality. Another advantage with single core solution is limited memory usage as single instances of data structures would suffice. These solutions may not be power efficient as complex algorithms are used to achieve best video quality. There are H.264 video encoders in market which run on single-core that achieves real-time encoding with decent quality but may not generate benchmarking sequences. But when it comes to HEVC, single core real- time solutions are not available in the market at this point of time and are not practical.

Multicore CPU

HEVC encoders are highly complex due to increased number of combinations in encoding options. Real- time HEVC encoding can be realized in multi-core solutions with trade-off in the video quality. Amount of trade-off purely depends on the type of parallelism implemented. Multi-core implementations offer flexibility to either have data partitioning or task partition. In case of data partitioning, there are high chances of breaking neighbor dependencies which will result in higher trade-off in terms of video quality. HEVC introduced new tool known as TILES in addition to slices, which is targeted for multi-core solutions without much impact on video quality. In task partition kind of design, achieving right load balance among tasks is a challenging job for better core utilization. The performance that could be achieved on multi-core gets limited due to number of cores present in a single-chip. Mutli-core solutions have bigger memory footprint as data structure needs to be replicated for different CPUs. Usage of cache needs to be carefully managed as shared memory can be accessed by all the CPUs. And also, increased number of CPUs will require higher amount of power making the solution power inefficient. Development cost of multicore solutions are relatively high compared to single CPU solutions as it requires complex designs for efficient task scheduling. Multicore solutions though power inefficient, can produce real time performance with decent video quality finding its applications in broadcast domain.
Fig: Comparison of Key factors across different Architectures


Heterogeneous architectures are one of the solutions for achieving real time performance in HEVC encoder. As HEVC encoders are highly complex, using GPU for performing multiple data processing tasks would help in improving the performance by a great margin. This will reduce the load on main CPU thus freeing it for other tasks. Introduction of GPU helps in power optimizing the encoder solution as GPUs are highly power efficient. Usage of GPU for video encoding will further boost the hardware resource utilization as GPUs are generally idle during video processing. All these advantages comes with the cost of video quality. Usage of heterogeneous architecture pose challenges in handling functional dependencies and neighbor data dependencies in block based HEVC codec, which will lead to reduction in video quality compared to sequential execution. Hence, the independent execution nature of many-core GPU architecture might degrade video quality by significant level. Further, usage of GPUs for sequential functionalities — such as entropy coding — is inefficient. This posses a greater challenge of syncing between CPU and GPU. Along with syncing issue, heterogeneous systems have another challenge of managing distributed memories. In highly band-width intensive video processing, problems like memory requirements & bandwidth increase with distributed memories. These solutions require in depth knowledge of the distributed memory architecture and frameworks supporting heterogeneous platforms, resulting in increased development cost and time. Most of the SOCs used in consumer electronic devices have GPUs which can be used for video processing.


Application Specific Integrated Circuit (ASIC) are the best way for achieving real time & power-efficient HEVC encoder solutions. In case of ASICs, hardware IPs will be built for different functionalities of HEVC codec. Though hardware solutions are much faster than their software counterparts, functional and neighbor data dependencies required by HEVC codec would limit its capabilities, as it is necessary to build an intelligent pipeline between these hardware modules. These hardware modules will have their own memory there by increasing memory footprint. These limitations will eventually lead to drop in video quality as it compromises the need of sequential execution. Video quality drop can be minimized to a great extent by proper implementation of hardware pipeline and efficient video algorithms. Generally these ASIC solutions are highly power optimized, targeted at consumer electronics with huge volume requirement. Complex designs, verification, validation and silicon manufacturing will increase the development cost, but provides best performing power efficient encoders. Not many ASIC HEVC solutions are expected due to increased complexity of the codec, increased cost of SOC manufacturing with advanced process, lack of VC funding in semiconductor startups and limited volume broadcast market where HEVC is required. Also, it is well known fact that video encoders evolve over the time and ASIC solutions fail to adopt the new algorithms. Because of this particular reason, ASIC solutions ar solutions may be adopted once HEVC encoders achieve a greater maturity in software solutions.
FPGA can be placed in between GPU solutions and ASICs in terms of performance, with similar video quality. FPGA providfigcaptionardware implementation and can adapt to evolving encoder algoritfigcaptionms. With software background, it would be easier to implement HEVC encoder on GPU rather than on FPGA, but greater performance can be achieved using FPGA while maintaining the video quality. Though there is a price advantage in using GPUs, the power consumption is much higher when compared to FPGA. FPGAs require larger die area making it unsuitable for consumer electronics. Cost and time needed for developing FPGA based video encoders are higher than GPU solutions due to hardware programming, but less than ASIC solutions as complex hardware designs are not necessary. FPGA video encoder solutions find its usage in markets with low volume requirement where ASIC are not a cost effective option. Due to right combination of power, performance, video quality and development costs, these solutions are effective for video surveillance and broadcast applications.
Each one of these solutions has its own pros and cons. Single CPU is the best solution especially for archiving applications where real time encoding is not needed. And ASIC SOCs finds the best usage in consumer electronics as it provides power efficient real time performance. Every solution has its own importance based on the application requirements about speed, quality, power consumption and development time & costs. (Stay tuned for the part-2 of this blog…….)
Pathpartner author Prashanth
Prashanth NS
Technical Lead
Pathpartner author Praveen
Pravenn GB
Sr. Software Engineer

HEVC Encoder – Software (Embedded) implementation possibilities…

HEVC is the new video compression standard, successor to AVC/H264. It is jointly developed by ITU-T VCEG and ISO/IEC MPEG standard bodies and claims to save 50% bit-rate (by keeping same quality) compared to it’s predecessor  AVC/H264. Along with primary goal of achieving higher compression and accommodating higher resolutions like 8K, it also embraced parallel architectures by introducing tools for parallel processing. HEVC is also a block based hybrid video codec and mostly similar to AVC, except for completely new in-loop non-linear filter named Sample Adaptive Offset (SAO) module. In order to achieve higher compression, bigger coding block sizes with hierarchical splitting nature are introduced along with tweaking & enhancements in existing video compression modules. The key tweaks/enhancements are (i) increasing number of intra prediction modes from 9 to 35, (ii) replacing MVP & skip with AMVP & merge list concepts, & Asymmetric Motion Partition (AMP) (iii) enhancements for frac-pel interpolation and (iv) variable transform unit sizes from 4×4 to 32×32. For the purpose of parallel processing, HEVC has eased & modified the de-blocking filter operation and introduced tools called Tiles and Wave-front Parallel Processing (WPP). At high level syntax, some modules are overhauled to improve the error resiliency, interface to system and provide new functionalities. The high level syntax includes NAL unit headers, parameter sets, picture partitioning schemes, reference picture management, and SEI messages.
The computational complexity of HEVC decoder is not expected to be much higher than that of H264/AVC decoder. HEVC decoder is considered to be 1.2x – 1.5x complex and possible to have a real-time software solutions on current hardware with minimal effort. But things look different from encoder perspective, the complexity is assumed to be 10x more compared to AVC encoder. Many mode combinations are introduced due to flexibility provided with quadtree structures, increased number of intra prediction modes and SAO offset estimation algorithms. If any encoder has to exploit full capabilities of HEVC, that encoder is expected to be several times more complex than AVC/H264 encoder. In order to achieve real-time encoder or encoder with decent frame rate (fps), it is required to trade-off compression rate marginally by reducing substantial computational complexity. Hence, it is going to be subject of innovating & research area to develop algorithms that handle the balance of trade-off b/w computational complexity & compression rate.
The higher complexity is generally the result of evaluating all the mode combinations in brute-force way. Execution can be accelerated, by developing approaches to figure out relevant combinations and evaluating only them, with little compromise in video quality. Along with avoiding insignificant combinations, introduction of threshold based early termination approaches would help in further improvement of execution time. In general, spatial & temporal neighbor’s statistics get used in above two approaches to predict relevant mode combinations and empirical derivation of thresholds for early termination.
We developed a simple & efficient HEVC encoder software solution (from scratch) by making smart trade-offs between video quality and complexity. This solution is multi-core friendly with good scalability, easily portable on homogeneous/heterogeneous architectures and supports almost all tools of HEVC. Along with efficient prediction & Rate control algorithms, the additional memory transfer overheads in the context of multicore are well understood and so efficient definition of data structures.  In our upcoming webinar talk titled “The simplified & efficient HEVC encoder solution”, we present the comparison results of complexity & video quality between our solution and HM13.0. In order to cover various tools, the comparison is carried out for All-Intra (AI), Random Access (RA), Low-delay B (LDB) and Low-delay P (LDP) coding configurations, with fixed Qp and Rate control.

Our publications on HEVC:

Pathpartner author Ramakrishna
Ramakrishna Adireddy
Technical Architect

VP9-next generation open source codec

Google’s VP9 is a new codec, which supports high resolution coding at over half the rate when compared to VP8, and is expected to give fierce competition to HEVC codec. VP9 is fundamentally no different from HEVC since both the codecs employ similar techniques of compression such as larger block size, transform, intra & inter predictions and CABAC based entropy coding. However, there are differences with respect to the usages & limitations of the techniques.
Comparison of features supported by VP9 & HEVC are tabulated below
Features VP9 HEVC
YUV format 420, 422, 444 420, 422, 444
Number of reference frames 3 (Last frame, Golden frame, AltRef frame) – DPB size is 8 frames 16
Bi-Prediction Yes Yes
Block Size 4×4 to 64×64 including rectangular partitions (such as 4×8, 16×8 etc…) 4×4 to 64×64 including rectangular partitions (such as 4×8, 4×16 etc..)
MV precision 1/8th for luma, 1/16 th for chroma 1/4th for Luma & 1/8th for Chroma
Motion Compensation 8-tap filter for 1/8th, 1/4th,1/2 pels 8 tap for ½ pel & 7 tap for ¼ pel
Intra prediction 10 modes 35 modes
In loop deblocking Yes Yes
Transform Sizes 32×32,16×16,8×8,4×4 32×32,16×16,8×8, 4×4
Tools Tiles, Segmentation Tiles, WPP, SAO
VP9’s compression ratio will be on the lines of HEVC, but, the efficiency of the VP9 is not truly reflected in the reference encoder from WebM since HEVC’s reference encoder is better than VP9’s WebM reference encoder as of today. However, the VP9 codec is patent free and available under BSD license, hence, VP9 is suitable for commercial applications as well open source community. VP9 will occupy the place for high resolution video encoding in youtube down the line (VP9 is already supported in Chrome browsers through libvpx). Google has released hardware IP for VP9 –G2 VP9 decoder IP- for semiconductor companies which would like to ship products with VP9.


VP9 is available from two places


VP8 could not challenge H.264 because of several reasons such as
  • VP8 entered video market very late when compared to the that of H264 (H264 came 6-7 years before VP8). When VP8 entered, H264 was widely accepted & deployed by the industry
  • Lack of clarity on the patent rights of VP8 in the initial years
  • No significant differentiation over H264
However, VP9 has entered scene at almost the same time as that of HEVC. In addition, VP9 is released under BSD license and claimed to be patent free. Also, VP9 is getting deployed in the industry through collaboration and strategic push by google. Hence, VP9 will give strong competition to HEVC. But, only time will tell on who is going to win the battle. For now, let us enjoy the battle!!!!
Pathpartner author Ajay
Ajay Basarur
Business Development Manager – I

Hybrid Thin Client App

Microsoft has standardized the keyboard shortcuts for various operations such as open, new, save, print, cut, copy, paste etc for all the applications in Windows. Hence irrespective of application, one can easily know how to say print the application contents (Ctrl P). This standardization can be a useful tool to develop something that we call “Hybrid” thin clients, specifically in the form of smartphone /tablet applications.
Today, all iOS and Android thin client apps lack a feature to provide these shortcuts to leverage the standard commands defined by Microsoft.
ppinng!Top is a pioneer to leverage this standard commands in providing easy user interface to users. We realized that selecting File or Edit menu and selecting the command can be cumbersome on these touch based devices. Using onscreen soft keyboard is also tough because the default keyboards do not come with Ctrl, Alt, etc keys. Hence we felt the need to provide buttons or application menu options locally as shortcuts to these standard commands (open, new, save, print, cut, copy, paste etc). This makes the application “Hybrid”, in the sense that some of the standard features of all Windows application are provided natively! Since these buttons or menus are part of application, it becomes easier to invoke them. No hassles of invoking the File or Edit menu of the application and then selecting the options.
Well, we have gone two steps ahead!
We allow users to define custom commands and provide them as shortcut buttons or menu options in the application natively. This helps the user to launch favorite and frequently used applications and commands. For example, finance executive may want to open Tally software because that is what he/she generally uses to work. He/she could create custom shortcut commands and run his favorite application on touch of a button. It is much more convenient than probably navigating through the start programs or desktop icons.
We also provide a local UI to launch an application, instead of the user having to press “Window + R” to invoke the RUN prompt and type the application name. Instead, the user just needs to touch a button to launch a native dialog, which takes user input. And ppinng!Top runs that command in remote desktop!
ppinng!Top is truly a “hybrid” thin client!
Watch out for this advanced feature in ppinng!Top app coming soon.
Pathpartner author Srikantha
Srikanth Peddibhotla