Video is becoming all-pervasive, and consumers want it in all types of devices with various levels of video quality. For some devices, such as cell phones or portable music/video players, a single digital signal processor (DSP) can handle all video-processing tasks and meet tight cost budgets.
For high-quality applications such as high-definition (HD) TV or video teleconferencing, however, the processing requirements are beyond those of any single device on the market today. Here multi-chip systems--whether made of multiple DSPs or DSP and FPGA combinations--can implement high-quality video without being excessively complicated or expensive. To make an intelligent decision as to how to partition the tasks among processors requires designers to consider not only resources available in the devices but also to examine the algorithms.
H.264 (MPEG-4 AVC) has emerged as the industry's hot new video codec. It produces MPEG-2 quality video with roughly half the number of bits or generates far higher quality with the same number of bits. These benefits demand significantly more processing power than an MPEG-2 codec.
Furthermore, the computational requirements are not fixed by the H.264 codec standard. They depend on a wide range of variables determined by the codec implementer including the resolution and frame rate, the output bitrate, the H.264 encoder profile, and specific features such as the search range, search algorithm, search partitions, refinement and number of reference frames. These parameters all have an effect on the perceived video quality. Thus, the computational requirements of a system are directly influenced by the video quality requirements and can vary substantially between systems that nominally use the same codec.
Look at the tasks: encoding
The major processing tasks in the video chain consist of encoding, transcoding, transrating and decoding. Even without selecting a particular DSP, it is possible to suggest ways to partition the tasks so that multiple devices are best employed.
The most computationally intensive task is encoding, where the system converts a raw uncompressed digital video into a compressed bitstream that makes it easier to send the images over channels that are bandwidth limited.

(a) Functional encoder partitioning

View full size
(b) Sliced encoder partitioning

View full size
(c) Combination encoder partitioning
Figure 1: Three primary ways to partition a video encoder. In functional encoding, an optional FPGA can take the role of the bottom block. For interprocessor communications (IPC), these example devices use SerialRapidIO (SRIO).
One straightforward way to divide the encoding tasks is with functional partitioning, in which one instance of the encoder runs across multiple devices. In a scheme that uses two DSPs, the top device focuses on the intra-prediction (pixel-domain prediction that uses only information contained in the pixel block itself), entropy encoding, and transform and quantization. (Figure 1a). The bottom DSP handles interblock operations, those that use information from surrounding blocks or other frames. Further, the bottom DSP performs in-loop deblocking to deblock the reconstructed frame. The functional encoder provides higher quality video because it works with the entire image at once and there are fewer artifacts.
One trade-off for this quality, though, is the amount of process interdependencies and the data that must move between the various stages. Thus the devices must have high-performance interprocess communications (IPC). It is also more difficult to increase the number of processors across which the encoder is spread. For example, the deblocking function can be moved to a separate device to free up the other device for motion estimation but doing so adds I/O requirements. This dataflow limits the number of DSPs that can be practically applied, so functional encoding has limited scalability.
Next: Alternate encoding approaches