Newsletter

DSP DesignLine  >  Design Center

Optimizing for instruction caches, part 2

Part 2 of this 3-part series looks at the tradeoffs between program and data cache optimizations, and shows how to choose the best compromise

Page 1 of 3

DSP DesignLine

Part 1 shows how to increase performance through code partitioning, function inlining, and other techniques. Part 3 shows how to optimize code by modifying the placement of functions in memory.

The trend in high-performance DSPs has been towards cache-based memory systems, where the speed of the DSP core greatly exceeds the speed of external memory. Software for such DSPs must be optimized for high cache utilization to keep the core fed with data. Cache optimization also reduces power consumption by allowing the core to run at a lower frequency.

In optimizing the application, we attempt to increase memory reuse by increasing memory locality. In this article we present the basic tradeoffs that exist between optimizing an application for two types of locality: data locality and program locality. (For definitions of locality, see part 1.)

Most algorithms are defined in a sequential manner, where input quanta are processed through a chain of blocks, as shown in Image 1.


Image 1. Sequential algorithm implementation.

Out of convenience, most algorithms are also implemented this way, as shown in Example 1 below.


Example 1. Sequential algorithm pseudo code.

Notice that the variable 'temp' stores the intermediate data passed between different stages of the algorithm.

For every loop iteration, the program goes through all stages of the algorithm sequentially. This results in poor temporal locality in the instruction stream—that is, instructions are not reused in the cache—and therefore poor cache performance.

A better option in many cases is to parallelize the application. Processing more than one input at the same time increases program locality. Image 2 below shows a parallelized implementation of Example 1.


Image 2. Parallel algorithm implementation.

Example 2 shows pseudo code for the parallelized example.


Example 2. Parallel algorithm pseudo code.

This parallelization will have the following effects:

Positive effects – Program locality is greatly increased. We now iterate N times on stage I of the algorithm before moving to stage_II. In doing so, we must partition the algorithm in a way that takes the size and construction of the instruction cache into account. Each loop must have a memory footprint small enough to provide good instruction cache utilization.

Negative effects – The data footprint has grown by xN times. Instead of having one 'temp' variable for passing data between stages, we now have an array of N 'temp' variables. This is an important consideration for us to take into account when parallelizing our algorithm. Different partitioning points will generate different intermediate data sizes and effectively set the algorithm's data footprint.

Side effect – Increased code complexity. In our example this seems trivial, but in real implementations this can mean significant additional overhead.

There is a limit to the benefits of parallelization. At a certain N value, the increased data footprint will cause data cache impact. Eventually that impact will offset the benefits we get from the increasing program locality in the instruction stream.

Example 3 gives psuedo code that illustrates the instruction/data tradeoff.


Example 3. Parallel/sequential algorithm pseudo code.

The goal is to find the value of M that optimizes utilization of the cache memory subsystem. This can be done easily using empirical methods. We simply profile the application with typical inputs and different M values to pinpoint the optimal M value.

Of course, there are always complicating considerations. One must also ask the following questions:

  • Is the algorithm latency is affected? In the original implementation of our example, the latency is equal to T_ stage_I+ T_ stage_II+ T_ stage_III. The latency in the optimized version is: M(T_ stage_I+ T_ stage_II)+ T_ stage_III. In some cases the cache-optimal value of M may result in an unacceptable latency.
  • Have data dependencies been broken? Many algorithms relay on feedback from previous iterations as inputs to the current calculation. In the optimized implementation, the added latencies may make this feedback unavailable.


Page 2: MPEG4 example  

Page 1 | 2 | 3



Rate this article
WORSE | BETTER
1 2 3 4 5




Related Content

TECH PAPER
1. Power-supply design for high-speed ADCs

TECH PAPER
2. Li-ion battery-charger solutions for JEITA compliance

TECH PAPER
3. Ultra-Low Power MSP430 MCU Value Line

COURSE
4. New C5514/15 DSPs extend the industry's lowest power 16-bit DSP platform

 


 Featured Jobs
Accenture seeking Project Management Team Lead in Charlotte, NC

Accenture seeking Software Engineer in Salt Lake City, UT

Boeing Company seeking Software Engineer in Herndon, VA

Switch and Data seeking Customer Solutions Engineer in Dallas, TX

Chart Industries seeking Sr. Developer in Cleveland, OH

More jobs on EETimesCareers
 Sponsor
 CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS:

 SPONSOR

 RECENT JOB POSTINGS
For more great jobs, career related news, features and services, please visit EETimes' Career Center.