Digital video found its first big consumer market in DVD players, and has moved
on from there. Now you can buy digital set-top boxes, camcorders, personal video
recorders (PVRs), portable media players, and even digital-video-enabled cell
phones. Products that can only handle analog video will soon be extinct; they’ll
be relegated to technology museums, sitting next to vinyl records and eight-track
tape players.
The mass migration from analog to digital video has been enabled by video compression
algorithms, or “codecs” (for COmpression/DECompression). In the
March
2004 edition of Inside DSP we introduced the basics of video compression.
In this article, we take a closer look at one of the hottest new video codecs,
H.264.
H.264 was jointly developed by the Moving Picture Experts Group (MPEG, part
of ISO) and the International Telecommunication Union (ITU). H.264, published
in 2003, is the standard’s ITU name; it also goes by the somewhat lengthy
“MPEG-4 Part 10 Advanced Video Coding (AVC).”This designation distinguishes
it from MPEG-4 Part 2 (often referred to as MPEG-4), a successor to MPEG-2
that has had limited success in the market.
H.264 is joining a field of established video codecs. The most popular of these
is MPEG-2, which is used in all current DVD players. Windows Media 9 and DivX
are widely used in streaming video applications (i.e., applications where compressed
video “streams” over the Internet and is played back in real time
rather than being stored first).
One key attribute of a video compression application is the bit rate of the
compressed video stream. Codecs that target specific applications are designed
to stay within the bit rate constraints of these applications, while offering
acceptable video quality. For example, DVDs use 6-8 Mbps with MPEG-2; video
conferencing applications require 50-300 kbps using H.263 (a video conferencing
codec). Streaming video applications typically require 50 to 500 kbps, but can
exceed 1 Mbps.
Emerging digital video applications such as HDTV and HD-DVD can easily demand
a staggering 20-40 Mbps using MPEG-2. Such high bit rates translate into huge
storage requirements for HD-DVDs, and a limited number of channels for HDTV.
Thus, a key motivation for developing a new codec is to lower the bit rate while
preserving (or even improving) video quality relative to MPEG-2.This was the
motivation that led to the development of H.264.
As an example of the improvement offered by H.264, Figure 1 shows the same
video frame encoded using MPEG-2 and H.264 at the same bit rate.
Click to enlarge.
Functional overview
Figure 2 shows a simplified block diagram of the H.264 encoder. The encoder
uses either intra-frame prediction or motion estimation and compensation to
predict the pixels of each image block. Intra-frame prediction uses the pixels
of neighboring blocks to predict the pixels of the current block. Motion estimation
finds a block in a previously encoded frame that closely matches the current
block, and motion compensation uses the selected block to predict the current
block. The difference between the predicted pixels and actual pixels is transformed
into the frequency domain, generating a block of frequency coefficients. These
coefficients are quantized, and the output bitstream is further compressed using
entropy coding.
Click to enlarge.
For H.264 to be successful, it must overcome a key hurdle: achieving widespread
adoption among product designers and consumers. A new video codec is more likely
to be widely adopted if it can serve a variety of applications. This is challenging,
because different applications require different bit rates and video quality,
among other characteristics. To meet these divergent needs, digital video codec
standards usually specify multiple variants, called “profiles.”
Some profiles are designed for ease of implementation and low processing requirements;
others emphasize reduced bit rate or high video quality. Profiles that target
streaming video applications are often designed for improved error resilience.
H.264 defines three profiles: Baseline, Main, and Extended. The Baseline profile
is the simplest profile; it targets mobile applications with limited processing
resources.
The Main profile is intended for digital television broadcasting and next-generation
DVD applications, and adds features that improve video quality—at the
expense of a significant increase in computational complexity.
The Extended profile targets streaming video, and includes features to improve
error resilience and to facilitate switching between different bit streams.
In addition to profiles, video codecs typically define multiple “levels,”
each of which specifies a set of constraints for key algorithm parameters, such
as the maximum bit rate, frame rate (in terms of frames per second, or fps),
resolution, number of macroblocks per frame, motion vector range, etc. For example,
in H.264’s Level 1, the maximum resolution is QCIF (144 lines and 176
pixels per line) at a frame rate of 15 fps.
Although levels and profiles are independent, in practice using a particular
profile implies the use of a particular set of levels, and vice versa. For example,
inexpensive products (e.g., those with small screens or modest processor speeds)
are likely to use the Baseline profile along with a level that specifies a low
resolution, frame rate, and bit rate.
Comparing key features
Table 1 compares some of the key features and characteristics of the three
H.264 profiles with those of the most common MPEG-2 profile (Main Profile at
Main Level) and the MPEG-4 Advanced Simple Profile.
Click to enlarge.
As shown in Table 1, H.264 has a number of features that differ from those
of MPEG-2 and MPEG-4. Not all features are supported by all three H.264 profiles;
each profile makes a different tradeoff in terms of video quality, bit rate,
and computational complexity, among other characteristics. In this section we
discuss a few key features of H.264 and explain how they help it achieve high
video quality and low bit rates.
Small Transform Size. MPEG-2, MPEG-4, and H.264 all transform
the input video to the frequency domain using a Discrete Cosine Transform (DCT).
(This transformation facilitates frequency-based compression techniques.) Unlike
MPEG-2 and MPEG-4, however, H.264 uses a 4x4-pixel base transform rather than
the more common block size of 8x8 pixels. The smaller block size reduces ringing
artifacts, thus improving picture quality.
Variable Block Size. H.264 supports macroblock partitioning
(i.e., partitioning a 16x16 macroblock into several smaller blocks).This partitioning
can be done with a number of block sizes: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8,
and 4x4. Larger blocks need fewer motion vectors (and thus fewer bits), but
may produce a bigger error between the original block and the motion-compensated
block (thus requiring more bits). The encoder attempts to partition the frame
in a way that makes an efficient trade-off between the number of bits needed
to transmit the motion vectors and the number of bits needed to transmit the
DCT coefficients of the residual.
Intra-Frame and Inter-Frame Prediction. Unlike MPEG-2, all
three H.264 profiles use intra-frame block prediction, which estimates pixel
values using previously decoded pixels from the same frame. H.264 also uses
inter-frame block prediction, which uses motion estimation and motion compensation
to exploit similarities between consecutive frames in a video sequence.
For each block, H.264 uses either inter-frame prediction or intra-frame prediction,
selecting the method that yields the most efficient coding. It does this by
performing both methods and then choosing the one that results in the smallest
error between the original and the predicted block. The use of intraframe and
inter-frame prediction is essential to lowering the bit rate of H.264 while
preserving video quality.
Motion Vector Prediction. H.264 tries to predict each motion
vector based on motion vectors in surrounding blocks. It then transmits the
error between the predicted and actual motion vectors rather than transmitting
the motion vector itself. This method is effective in reducing bit rates in
cases where there is a large moving object and the motion vectors that comprise
it are similar. In this case, the error between the predicted and actual vectors
is small, and requires fewer bits to transmit than the actual motion vector.
Quarter-Pixel Motion Vector Resolution. H.264 uses motion
vectors with 1/4-pixel resolution, compared to 1/2-pixel resolution in MPEG-2.The
finer resolution helps to decrease the magnitude of the residuals and thus reduce
the number of bits needed to transmit them.
Achieving sub-pixel resolution requires interpolation between pixels. Interpolating
for 1/4-pixel resolution rather than 1/2-pixel resolution is somewhat more computationally
demanding, and requires higher memory bandwidth. In addition, using 1/4-pixel
resolution means that the encoder has to evaluate more candidate motion vectors
for each block—further increasing the computational load.
Multiple Reference Frames. The Main and Extended profiles,
like other MPEG standards, support bi-directional motion prediction, which uses
both past and future reference frames to predict the contents of the current
block.
However, H.264 differs from other codecs in that the encoder is allowed to
use more than two reference frames (i.e., more than one past and one future)
for motion estimation. Using multiple past or future frames can improve coding
efficiency when encoding video sequences with repetitive motion or brief object
occlusion. For example, the H.264 encoder can reference an older frame from
the sequence and achieve a better compression ratio than an encoder that always
uses the previous frame.
The Main and Extended profiles also use weighted prediction, which blends two
motion-compensated blocks from different reference pictures, thus improving
compression efficiency in fade-ins and fade-outs. This feature, however, doubles
the work performed by the encoder and decoder in motion estimation and compensation.
In-Loop Deblocking Filter. H.264 uses an adaptive in-loop
deblocking filter to deblock the reconstructed frame. Implementing the deblocking
filter in-loop (i.e., as part of the encoding and decoding algorithms rather
than as a separate post-processing step) can increase image quality, especially
at low bit rates—but it increases computational complexity. Neither MPEG-2
nor MPEG-4 use in-loop deblocking.
Sophisticated Entropy Coding. Both MPEG-2 and MPEG-4 use Huffman
coding (a type of entropy coding) to encode the output bit stream. Instead of
Huffman coding, the H.264 Baseline and Extended profiles use Context-based Adaptive
Variable Length Coding (CAVLC). CAVLC is somewhat more complicated than Huffman
coding, but is still fairly simple. It uses an integer number of bits to represent
each coded value, and doesn’t require a lot of processing horsepower.
The Main profile uses a more complex entropy coding scheme, called Context
based Adaptive Binary Arithmetic Coding (CABAC). Since it is based on arithmetic
coding, CABAC can use a fractional number of bits to encode each coded value,
which results in better coding efficiency (and a lower bit rate) than CAVLC—at
the cost of additional computational complexity.
Multiple Intra-Prediction Modes. H.264 defines several intra-prediction
modes for predicting blocks. Each mode specifies a method of predicting the
pixels in a 16x16 or 4x4-pixel block using the previously decoded pixels above
and to the left of the current block. A few interpolation directions (in addition
to horizontal and vertical) are supported; there is also a DC mode, which sets
all pixels to the average of the reference pixels, and a special plane-fitting
mode useful for areas with constant luminance gradient.
New Slices. Each frame in H.264 can be partitioned into one
or more “slices;” each slice, in turn, can contain a different number
of macroblocks. There are several different slice types. All three profiles
support “I-slices” that contain only intra-predicted macroblocks,
and “P-slices” that contain inter-predicted (motion-compensated)
macroblocks. The Main and Extended profiles also support “B-slices,”
which implement inter-prediction from two reference frames.
The Extended profile adds “Switching I” (SI) slices and “Switching
P” (SP) slices. These slices are used to facilitate features like random
accesses and video stream switching, and are useful for streaming video applications.
For example, SP slices can be used for seamless switching between video streams
carrying the same video content but encoded at different bit rates. SI slices
use only intra-frame prediction (not inter-frame prediction), and thus can be
used for switching between unrelated video streams.
The Baseline and Extended profiles also support “slice groups,”
“redundant slices,” and “arbitrary slice order (ASO).”
Slice groups allow the macroblocks that comprise a frame to be transmitted in
an order that’s different from the raster display order, which improves
error resilience. Redundant slices carry a reduced-resolution version of the
primary video sequence. These slices are normally ignored by the decoder, but
can be used if the primary video sequence is corrupted. Arbitrary slice order
allows the slices to be transmitted in any order.
Implementation issues
H.264 is more complex and computationally demanding than previous-generation
video codecs. The complexity of the codec translates into additional development
effort and a longer time to market; both of these can be a significant burden
for companies trying to implement the H.264 themselves. Reference C code is
available, but it is not a good starting point for a well-optimized implementation
because it is written to illustrate the specification—not for efficiency.
Fortunately, there are companies that specialize in providing optimized H.264
codecs for different platforms. Using such third-party implementations reduces
the software development effort, but still requires a significant system design
effort. Designers also need to ensure that they have sufficient horsepower to
run the codec in real time. The latter can be an important consideration given
that H.264 has higher processing demands. To read more about the challenges
of implementing video software, see “Developing
Software for a Digital Video Product.”
Watch that codec
The initial reaction to H.264 from the industry has been extremely positive,
which has encouraged many companies to develop H.264-based solutions. (At least
one company—Conexant—has already begun sampling H.264 chips.) The
codec’s high implementation costs and processing requirements, along with
competition from other codecs, may initially slow its rate of adoption—but
it’s already clear that H.264 is the codec to watch.
Resources
Inside DSP readers may receive a 20% rebate of their fee on any seminar
purchased by March 30, 2005. Just enter code DSP0503 under “Promotional
Codes” on the purchase form. See www.BDTI.com/video.html
for details.