|
[For more on this topic, see Optimizing Compilers and Embedded DSP Software and Get better DSP code from your compiler]
As DSP processors become more and more powerful, the portion of code that can remain at the C level increases. However, compilers cannot produce optimized code without assistance from the programmer. To maximize the performance, the programmer must tune the compiler using various compilation options.
Unfortunately, it is quite common to find DSP applications that don't take advantage of the tuning capabilities of the compiler. Instead, they are compiled with the same set of compilation options throughout the whole application. This method ignores the special needs of each function.
Smart selection of compilation options can yield a dramatic code performance improvement. For example, code size can be greatly reduced. This is often a major factor when evaluating the cost of a product, as it has a direct influence on the amount of memory required. This article shows how to improve code size consumption as well as the consumption of other important resources.
Special needs at the function level
In order to understand how to save code size using smart selection of compilation options, one has to be familiar with the cycle count vs. code size tradeoff. A good example of such a tradeoff is the common compiler optimization technique of loop unrolling and software pipelining (SWP).
This technique involves duplication of the loop body for loop unrolling and copying of certain instructions from within the loop to outside of the loop for SWP. The technique is highly beneficial when using a multi-issue Very Large Instruction Word (VLIW) processor with a deep pipeline. In this case, SWP breaks many of the dependencies inside the loop, and together with loop unrolling can dramatically increase Instruction Level Parallelism (ILP).
Here is a simple multiply-and-accumulate (MAC) loop to demonstrate loop unrolling and SWP.

Figure 1. A basic C level mac loop.
Below is a rather inefficient yet compact assembly code excerpt generated when the CEVA-X1641 DSP Core compiler was tuned for code size minimization. It uses SWP and no unrolling. The loop takes about 959*2 = 1918 cycles to execute and consumes less than 20 bytes of code size.

Figure 2. A compact yet relatively inefficient assembly implementation of a mac loop generated by the CEVA-X1641 compiler.
In contrast, the CEVA-X1641 compiler yields totally different results when tuned for cycle count minimization. In the example below we can see the CEVA-X1641 compiler in full power, utilizing every bit of hardware the CEVA-X1641 Quad-MAC architecture has to offer. All four MAC units are used simultaneously and the entire 128-bit memory bandwidth is used in each cycle. The compiler unrolls the loop eight times and uses SWP abundantly. The loop takes about 119*2 = 238 cycles to execute and consumes over 100 bytes of code size. This is eight times faster than the previous loop implementation, but at the cost of a 5x increase in code size.

Figure 3. An Optimal yet Relatively Large Assembly Implementation of a mac Loop Generated by the CEVA-X1641 Compiler.
The ILP definitely increases performance. However, the massive code duplication mentioned above may require extra memory, which might be beyond the reach of a normal embedded application. How then can we enjoy that performance boost without paying the painful price of extra memory? We will soon find out. One thing is certain though—in many embedded architectures, cycle count improvement usually involves code size increase and vice versa.
|