The algorithms used in video applications are increasingly more sophisticated. The continuous evolution of standards and the highly competitive nature of the industry are challenging design teams to find more efficient flows for implementing video hardware. Traditional design flows require developing the micro architecture, coding the RTL, and verifying the generated RTL against the original functional C or MATLAB specification.
This article presents an overview of a C-based design flow that enables designers to generate high-quality hardware for video algorithms. Algorithmic C synthesis is used to generate optimized RTL from algorithmic specification written in pure ANSI C++. A wide range of micro-architectures for ASIC and FPGA target technologies can be generated from the same source by interactively setting directives during synthesis. The synthesis flow also generates the necessary wrappers so that the original C++ testbench may be used to verify the generated RTL.
A video line filter example is used to illustrate techniques that are useful for coding video algorithms to produce high-performance hardware using C-based design.
1. Introduction to Algorithmic C Design
Figure 1 shows how the algorithmic C design methodology fits into a typical design flow and how it compares to a traditional RTL design flow. Algorithmic synthesis automates the generation of the optimized micro architecture that meets the desired performance/area goals. It also enables using the original C testbench to functionally verify the generated RTL [1]. Algorithmic synthesis automates what is otherwise a time consuming and error prone process of crafting the micro-architecture, coding the RTL, coding the testbench and verifying/debugging both the RTL and the testbench. As much as
60% of the functional bugs are introduced in the manual generation of RTL. Algorithmic synthesis provides a safer flow by automating the generation of RTL. In the traditional flow, a number of iterations of tweaking the RTL or even revising the micro-architecture may be required to meet the target clock frequency.

View full size
Figure 1: Traditional RTL flow compared to the C-Based Design Flow
The advantages of an automated C-based synthesis flow are reflected both in significantly reduced design times as well as higher quality of designs, because a variety of micro architecture can be rapidly explored. The C synthesis product used in this paper
is Catapult C Synthesis. The algorithmic synthesis process generates the RTL with detailed knowledge of the delay of each component to eliminate the guess work that is otherwise unavoidable when the micro architecture and RTL are generated manually.
The algorithm to be synthesized is written in pure ANSI C++. The word pure here is meant to emphasize that there are no extensions to the language and the functionality to be synthesized mirrors the functionality of the algorithm. That is, no complexities of timing
or concurrency or target technology are coded in the algorithm. The C++ language is the most widely used modeling language for system design and is defined by a very mature standard: ISO/IEC14882-1998 (approved by ANSI 7/27/1998). Open source C++ datatype
libraries (implemented as templatized classes) provide bit-accurate integer and fixed-point datatypes. Such datatypes facilitate the numerical refinement of algorithms to obtain efficient hardware implementations and the writing of IP blocks that are parameterized in the precision used to implement the algorithm. For example, using templatized bit accurate datatypes it is possible to write a generic FIR filter that is parameterized on the number of filter taps as well as the precision of the computation.
In the algorithmic C flow the designer explores different architectures by directing how data will move in an out of the block (interface synthesis), mapping arrays to memories, and deciding how much parallelism is required to meet the throughput/latency goals. In
traditional flows, defining the architecture and generating the RTL from the C model is done manually, a process that may required several months to complete. In a C based synthesis flow, the architecture definition and the generation of RTL is often accomplished in a matter of days to weeks.
One key advantage of using algorithmic flow is that the source remains uncommitted to target implementation technology and target micro-architecture. That is, efficient ASIC and FPGA implementations meeting a wide range of performance or interface requirements may be generated from the same C++ source. For instance, the same source may be used to generate FPGA hardware for prototyping and to generate ASIC hardware for production. This is of particular interest for video designs, as FPGA prototyping allows to speed up validation of computationally intensive video processing blocks.
The typical design flow for implementing an algorithm starts with writing the algorithm at a functional level using languages such as MATLAB, C or a combination of the two languages (C here is used to refer to both C and C++). Due to its faster execution speed, C is typically preferred over MATLAB especially for modules being implemented in hardware. These modules are often the most computationally intensive, making them the most demanding to simulate. The algorithm may initially be written using floating point
arithmetic then numerically refined to finite-precision arithmetic using either integer or fixed-point bit-accurate data types.
This article is organized as follows. Section 2 presents an overview of algorithmic C synthesis and the transformations that are essential in that process. Section 3 provides an introduction to general issues that need to be taken into account when coding the C algorithm for synthesis. Section 4 and Section 5 cover coding aspects of a video line filter and cascades of filters to obtain high-performance hardware.
NEXT: Algorithmic C Synthesis