Parallelizing the HEVC Encoder

September 26, 2016
High Efficiency Video Codec is the next generation video codec which promises to cut the bit-rate in half when compared to the H264/AVC codec. HEVC has bigger block sizes, more prediction modes and a whole lot of other algorithms to achieve the said target. However, this comes with substantially higher costs in terms of computational power requirement. HEVC, as it stands today, needs about 8 times more computational power to deliver twice the compression ratio.
Keeping pace with Moore’s law, the general purpose processors have been cramming more logic in lesser space, thanks to the evolving manufacturing processes that are adding more and more computational power. This computational power is available in the form of increased cores per processor. So the entire focus of HEVC encoders is not just having the fastest algorithm with the best quality, but also on achieving the one which can be executed in parallel on multiple cores with a minimal penalty on the quality. Server grade processors have been cramming more than 18-cores into a single socket, increasing the importance of system design in HEVC encoders. This provides a strong new impetus to transform all algorithms and code-flows to maximally utilize the entire available computational power across all the cores.

So how does HEVC fare in a multicore scenario?

HEVC has included many tools which are parallel processing friendly. These multi-core friendly tools are discussed in detail in this whitepaper:
Parallelizing HEVC encoder using slices and tiles is very simple. An input frame is divided into a number of slices, equivalent to the number of cores or threads required to process. This results in completion of the best possible multicore encoder in the shortest duration with a very good multicore scaling factor.
Scaling factor on a multicore system is the speedup achieved over a single core, by using multiple cores for the same job. If a job takes 1s on a single core but takes only 0.5s on an N-core system, the scaling factor in such case is 2. If the N-core is a dual core system, an ideal scaling of 100% is achieved. But if it is a quad core system, only 50% of scaling is achieved. It is almost impossible to achieve ideal scaling unless the job being done contains further independent jobs.
Figure 1 : Slices and Tiles
An HEVC frame, when partitioned into slices and tiles and encoded on different cores, should have an ideal scaling because each slice and tile is independent of each other. But due to the total complexity of the blocks being encoded in the slice or tile, this cannot be achieved. Each slice or tile varies in complexity, and hence, different cores take different amounts of time to encode them. The amount of time that the threads wait after their tasks are done is inversely proportional to the complexity of the slice or tile they encode. And the wait time too is inversely proportional to the scaling factor; i.e., the more the core waits, the less is the scaling. This is further aggravated by the fast algorithms that are present in encoders which let them predict the encoded modes accurately.
Also, by encoding with slices or tiles the frame basically develops into a collection of segments of independently encoded streams which have no interlink between them. This will have a large impact on visual quality with visible compression artefacts at the edges of slices or tiles. These artefacts can be partially avoided by applying de-blocking and the new SAO filters across the slices or tiles. But when encoding a high motion sequence with a challenging bitrate, the artefacts will definitely be noticeable.
The challenge in multicore HEVC encoding is always achieving the best possible scaling while sacrificing the least possible video quality. Performance and quality measures of a video encoder are always in battle with each other, and with multicore the battle gets more ammunition. But diplomacy, which is, Wavefront Parallel Processing (WPP) tries to keep peace to a certain extent.


Wavefront Parallel Processing
Figure 2 : Wavefront Parallel Processing
One of the major serial processing block in any video encoder is the block by block arithmetic coding of the signals and transform coefficients in a raster scan order. Again with slices and tiles, this can be parallelized but with a penalty on the bits taken to encode. With Wavefront Parallel Processing or entropy sync, the arithmetic coding is parallelized with a catch. The bottom row takes the seed context from its top right neighbor before starting its row encoding. This results in a penalty which is lesser than slices or tile encoding.
The same approach to parallelize the mode decision encoder is taken where each CTU block waits for the completion of its top right neighbor.
Parallel row HEVC encoding
Figure 3 : Parallel row HEVC encoding
The above figure shows a quad core encoder, whose pipeline is built up and is operating in a steady state. Before starting the encoding of each CTU, the completion of top right is checked and since the top row is ahead of the current row, the check is almost always positive. This design preserves the data dependencies present in the original encoder and gives the best possible performance while sacrificing least amount of quality. This design has an overhead of pipe up and pipe down stages, but when encoding a huge number of CTUs, they can be neglected. Also, there will be small scaling losses due to differences in CTU encoding times.
The multicore parallelizing schemes presented here scale up with more cores and bigger input resolutions, but with a drop in scaling efficiency. Having cores on more than one socket, the locality of the data used by the encoder contributes a major chunk in the encoder performance. PathPartner’s multicore HEVC encoder is NUMA optimized to achieve best possible scaling on systems with more than one socket.
Pathpartner author Shashikantha
Shashikantha Srinivas
Technical Lead