Newsletter

DSP DesignLine  >  Design Center

Programming and optimizing C code, part 5

Part five of this five-part series shows how to optimize memory performance, and how to make speed vs. size tradeoffs.

Page 1 of 2

DSP DesignLine

[Editor's note: Part 4 explains why it is important to optimize "control code," and shows how to do so.]

Memory Performance
Compilers tend to treat memory as a uniform and infinite resource that can be accessed at no cost. In reality, memory access costs may dominate the performance of your application. Thus, it is worth learning a little about memory effects.

Most DSPs access both internal (on-chip) and external (off-chip) memories. There is an enormous performance gap between the fast on-chip memories (commonly SRAM) and the slower external memories (commonly SDRAM). Compilers seldom engage in intelligent data placement. Thus, it is up to the programmer to place the most-critical program code and data into on-chip memory. This is usually done via a linker control file. You can also tell the linker to place specific arrays into internal memory. For example, the following statement tells the linker to place the array in internal memory if there is room:


The speed of internal memory accesses depends on the data access patterns. Under ideal conditions, most DSPs can make two data accesses per cycle. However, DSP memories are usually split into multiple "banks," a bank may stall if it receives two simultaneous access requests. Thus, on-chip memory may stall if it receives two access requests close to each other.

Unfortunately, the compiler does not know whether two data entities are close by in space, because memory layout is a linker function. How can the compiler decide whether it should generate dual-access code? Part of the answer is to use pragmas. For example, Blackfin has a pragma "different_banks," which tells the compiler that the data will derive from different banks, and to schedule aggressively for dual access.

The speed of external memory accesses depends upon properties of the memory as well as the speed and width of the bus linking memory to processor. On many DSPs, the programmer can select the speed of the memory bus. To save power, you can start by selecting a low bus speed. If you discover that your application performance is memory-bound, you can ramp up the bus speed. This is trickier than it sounds, because the processor and bus speeds are normally provided as multiples of an input clock signal. (More precisely, the processor speed is set as a multiple of the input clock, and the bus speed is set as a fraction of the processor speed.) Thus, only certain combinations of processor and bus speeds will be useful. For example, suppose you are using the ADSP-BF533 Blackfin and you want to run the memory bus at its maximum rate of 133 MHz. As illustrated in Figure 1, only specific combinations of processor and bus speeds meet this goal.


Figure 1. The yellow bars show combinations that come closest to meeting the target of a 133 MHz memory bus. The values shown are for a 750 MHz ADSP-BF533.

The significance of external memory performance can be realized by analyzing Figure 2. The first row (L1) shows that internal memory has single-cycle access. Cached results are in the second row (L3 cached). The third row (L3) shows what happens in an uncached system: In this case, sequential 16-bit transfers take 40 cycles per item. Note that the cost of a memory access shown here could swamp any benefit of an optimizing compiler.

External memory has a structure of rows (where one row occupies perhaps 4kbytes) and a larger structure of banks. When memory accesses move from one row to the next, it adds an extra delay. This delay is not a significant issue for sequential accesses, because sequential access generate many accesses within each row before moving on to the next. The alternate rows column shows a worse case arising from a more random access pattern. You can recover most of the lost performance by scattering data amongst the banks, as illustrated in the last column.


Figure 2. Memory access times for various scenarios.

A natural reaction to Figure 2 might be to rely on caching. Clearly, caching generates a significant advantage over the raw memory costs in row three. However, note that sequential access from cached external memory still costs 7.7 times more than accessing internal memory. Caches work best if you re-use data, so you should think about your data access patterns. Try to craft loops that re-use data as much possible, rather than constantly fetching new data from external memory.

The compiler can give you very little help with all this. As mentioned in part 1, you will often spot a memory problem by using a statistical profiler. Large numbers of unexplained stalls on a load or store instruction are a useful hint that you need to re-think your use of memory



Page 2: Code Speed vs. Space  

Page 1 | 2



Rate this article
WORSE | BETTER
1 2 3 4 5




 Featured Jobs
Videon Central seeking VP of Engineering in State College, PA

Protingent Staffing seeking Electrical Engineer in Mountain View, CA

True Circuits seeking Analog-Mixed-Signal IC Layout Engr in Los Altos, CA

ON Semiconductor seeking Sr Analog Design Engineer in Colorado Springs, CO

SanDisk seeking Sr Process Integration Engr in Milpitas, CA

More jobs on EETimesCareers
 Sponsor
 CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS:

 SPONSOR

 RECENT JOB POSTINGS
For more great jobs, career related news, features and services, please visit EETimes' Career Center.