Newsletter

Tutorial: Programming High-Performance DSPs, Part 2

This second of a three-part series explains how to optimize code for high-performance DSPs, with a focus on loop unrolling and software pipelining. It shows how to minimize loop overhead, and how to keep a DSP's execution units busy.

Page 1 of 2

DSP DesignLine

[Part 1 of this series introduces VLIW pipelines, multi-level memory architectures, and Direct Memory Access (DMA).

Part 3 shows how you can help the compiler produce faster code. It also explains how to optimize for minimum power consumption.]

PALLELISM IS THE KEY
The standard rule when programming superscalar and VLIW devices is "Keep the pipelines full!" A full pipe means efficient code. In order to determine how full the pipelines are, you need to spend some time inspecting the assembly language code generated by the compiler. You can usually spot inefficient code by the abundance of NOPs in the code. NOPs indicate inefficient use of the pipelines.

To demonstrate the advantages of parallelism in VLIW based superscalar machines, lets start with a simple looping program shown in Figure 7. If we were to write a serial assembly language implementation of this, the code would similar to that in Figure 8. This loop uses one of the two available sides of the superscalar machine. By counting up the instructions and the NOPs, it takes 26 cycles to execution each iteration of the loop. We should be able to do much better.

There are two things to notice in this example so far. Many of the execution units are not being used and are sitting idle. This is a waste of processor hardware. Second, there are many delay slots in this piece of assembly (20 to be exact) that the CPU is stalled waiting for data to be loaded or stored. When the CPU is stalled, nothing is happening. This is the worst thing you can do to a processor when trying to crunch large amounts of data.

There are ways to keep the CPU busy while it is waiting for data to arrive. We can be doing other operations that are not dependent on the data we are waiting for. We can also use both sides of the superscalar architecture to help us load and store other data values. The code in Figure 9 is an improvement over the serial version. We have reduced the number of NOPs from 20 to 5. We are also performing some steps in parallel. Lines 4 and 5 are executing two loads at the same time into each of the two load units (D1 and D2) of the device. This code is also performing the branch operation earlier in the loop and then taking advantage of the delays associated with that operation to complete operations on the current cycle. Notice that there is a new column in this code, one that specifies which execution unit you want to use for a particular operation. This flexibility to specify execution units allows you to manage your operations better.


Figure 7. Simple for loop in C


Figure 8. Serial Assembly Language Implementation of C loop


Figure 9. A more parallel implementation of the C loop

Loop Unrolling
Loop unrolling is a technique used to increase the number of instructions executed between executions of the loop branch logic. This reduces the number of times the loop branch logic is executed. Since the loop branch logic is overhead, reducing the number of times this has to execute reduces the overhead and makes the loop body, the important part of the structure, run faster (Figure 11). A loop can be unrolled by replicating the loop body a number of times and then changing the termination logic to comprehend the multiple iterations of the loop body. The drawback to loop unrolling is that is uses more on chip registers. Different registers need to be used for each iteration. Once the available registers are used, the processor starts going to the stack to store required data. Going to the off chip stack is expensive and may wipe out the gains achieved by unrolling the loop in the first place. Loop unrolling should only be used when the operations in a single iteration of the loop do not use all of the available resources of the processor architecture. Check the assembly language output if you are not sure of this. Another drawback is the code size increase. Unrolled loop requires more instructions and, therefore, more memory.

Page 2: Software pipelining  

Page 1 | 2








 Featured Jobs
Ascension Health seeking Solutions Development Analyst in St. Louis, MO

National Semiconductor seeking Principal IC Design Engineer in Santa Clara, CA

Taylor Guitars seeking Sr. Web Designer in El Cajon, CA

Covidien seeking Hardware Manager in Boulder, CO

Sierra Nevada seeking Software Engineer in Hagerstown, MD

More jobs on EETimesCareers
 Sponsor
 CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS:

 SPONSOR

 RECENT JOB POSTINGS
For more great jobs, career related news, features and services, please visit EETimes' Career Center.