Newsletter

Architecture-oriented C optimization, part 1: DSP features

Here's how C optimizations can take advantage of zero overhead loop mechanisms, hardware saturation, modulo registers, and more.

Page 1 of 3

DSP DesignLine

Part 2 looks at memory-related optimizations. It will be published September 3.

Know your hardware! That's what it's all about. Using programming guidelines derived from the processor's architecture can dramatically improve performance of C applications. In some cases, it can even make the difference between having the application implemented in C and having it implemented in assembly. Well written C code and an advanced compiler that utilizes various architectural features often reach performance results similar to those of hand written assembly code. A quick survey of assembly coding drawbacks should make it fairly clear why real-time programmers need architecture oriented programming guidelines in their toolkit.

Loop mechanisms
Loops are responsible for most of the cycles in an average application. It is therefore essential to know the hardware loop mechanism and how to comply with its needs when writing critical loops. Trivial loop mechanisms consist of compare and branch sequences. The compare instruction evaluates the loop condition and the branch instruction jumps according to the comparison result. This method has two major drawbacks:

  1. Branch instructions break the pipeline. This introduces an overhead, which further increases when branch prediction is wrong or when branch delay slots are not filled to completion. Having this overhead in a loop means that it is multiplied by the number of iterations.
  2. All "do-while" loops execute at least once. Therefore, the compiler can safely evaluate the loop condition at the bottom of the loop. In contrast, a "while" loop will execute zero times if the condition is false at the top of the loop. For these loops, the compiler must add a compare and branch sequence, also known as a guarding-if, before the loop. This extra sequence adds significant overhead, particularly in nested loops where the overhead is multiplied by the total number of iterations of all outer loops.

Zero overhead loop mechanisms
In zero overhead loop mechanisms, the number of iterations is calculated prior to entering the loop body. If iteration count is too low (normally negative), the loop is skipped with little overhead. This unique mechanism has one major requirement—the number of iterations has to be known in advance prior to entering the loop. Therefore, the loop condition has to be simple enough to be pre-calculated. Consequently, the following elements are not recommended as loop delimiters:

  • Function calls
  • Variables that may change inside the loop, including volatile ones
  • Global variables that may be changed by functions called in the loop

Branch and decrement loop mechanisms
In branch and decrement loop mechanisms, the number of iterations is stored in a loop counter. The loop counter is automatically decremented when jumping to the loop beginning. When it reaches zero, the loop breaks. This mechanism has the potential of zero overhead too, but unlike the zero overhead loop mechanism, this one cannot automatically skip the loop if it does not iterate. Therefore, in some cases, dedicated assembly code is required in order to skip the loop. This extra code obviously involves an overhead and the following guidelines can be used to eliminate it:

  • Do-while loops iterate at least once and make skip code unnecessary. Use them whenever possible.
  • Use dedicated pragmas to specify that a loop iterates at least once. These are normally referred to as "trip count" or "must iterate" pragmas and they too make skip code unnecessary.
DSP fundamentals
DSP fundamentals include architectural features that are found in most DSP processors and are extremely important for the performance of various DSP applications.

Multipliers
Multiplication is one of the most common operations in DSP code. It is often combined with accumulation to form a multiply-accumulate (MAC) operation which, for example, is massively used in filter implementation. A very important aspect of multipliers is their natural input width. Multiplying C operands of the same width as the multiplier's natural input width is essential for having an efficient single-cycle multiplication. For example, multipliers with 16-bit inputs should be triggered with 16- bit operands in C code multiplications. Narrower operands are also fine as they are automatically extended.

Problems start when C multiplication operands are too wide for the multiplier. This often triggers the compiler to use expensive sequences with several multiplications to compensate for the narrow multiplier inputs. In some cases the algorithm calls for wide multiplications and there is no alternative. However, C variables wider than necessary are often used inefficiently, mostly when porting code from architectures with wider multipliers or from PC.

Some multipliers yield outputs with no equivalent C variables. The two multipliers of the CEVA-TeakLite-III DSP Core for example, yield a 72-bit output when multiplying two 36-bit inputs, as shown in figure 1. The output is stored in two 36-bit accumulators and there is no native ANSI C variable that can handle them. To solve this problem, the CEVA-TeakLite-III Compiler provides assembly macros that enable direct access to 72-bit operations from C code. The two 36-bit outputs are stored in two variables of type 'acc_t' (CEVA's C language extension for accumulator type).


Figure 1. CEVA-TeakLite-III computation and bit manipulation unit (CBU).


Page 2: Parallel architectures  

Page 1 | 2 | 3

Related Links:
  • Optimizing C programs for embedded SoC applications
  • Programming and optimizing C code, part 1
  • DSP programmer's guide






  • Ceva
    Related Content

    WEBINAR
    1. Achieve greater productivity and ease of use with Targeted Design Platforms enabled by Virtex-6 and Spartan-6 FPGAs

    COURSE
    2. Fundamentals of Embedded Systems Security

    WEBINAR
    3. Optimizing Noise in the Sensor Signal Path (Part III)

    COURSE
    4. Hands-on Training with the New TMS320VC5505 eZdsp USB Stick Development Tool

     


     Featured Jobs
    Accenture seeking Project Management Team Lead in Charlotte, NC

    Accenture seeking Software Engineer in Salt Lake City, UT

    Boeing Company seeking Software Engineer in Herndon, VA

    Switch and Data seeking Customer Solutions Engineer in Dallas, TX

    Chart Industries seeking Sr. Developer in Cleveland, OH

    More jobs on EETimesCareers
     Sponsor
     CAREER CENTER
    Ready to take that job and shove it?
    SEARCH JOBS:

     SPONSOR

     RECENT JOB POSTINGS
    For more great jobs, career related news, features and services, please visit EETimes' Career Center.