|
[Editor's note: Part 2 shows how to optimize DSP kernels (i.e., inner loops), and how to write fast floating-point and fractional code. Part 4 explains why it is important to optimize "control code," and shows how to do so.]
To optimize your code, the compiler will analyze your program and make deductions. Compilers will often attempt optimization of wide scope, so this analysis can span multiple functions, multiple files, or even the entire width of the application. This approach gives the compiler maximum flexibility, but it can slow compile times. More importantly, such schemes can be badly affected by aliasing and flow-of-control uncertainty.
Thus, you must consider what the compiler can deduce about your program. You can often improve compiler performance by simplifying the code and by adding information. Information can be added to a program at a high level by manipulating compilation options. It can also be added at a low level by recoding and adding informative statements. We will look at this latter approach in this article.
Pointers or Arrays
A common question is whether data access in tight loops should be coded using pointers or
arrays. It doesn't make a lot of difference to a modern optimizing compiler. Arrays have the useful property of an implied length, which helps the compiler understand if two data reference are definitely not going to overlap, which is often vital for aggressive optimization. On the other hand, pointers are closer to the hardware.

Figure 1. Alternatives for addressing in tight loops.
Loop Structure
It often helps to experiment with the loop structure of the program. Compilers focus their optimization efforts on the inner loop, so pulling operations from the outer loop into the inner loop can improve performance. Sometimes you can even collapse the inner and the outer loop into a single loop. The downside to this approach is that the single remaining loop may be too complicated for the compiler to optimize.
You can also reverse the order of the loop nest, that is, you can make the inner loop become the outer loop. One reason to make this transformation is to achieve sequential data access. As we mentioned in the last article, sequential data access is a requirement for vectorization. In order to achieve sequential access patterns, you may need to re-order the data after reversing the order of the loop nest.
To illustrate this point, consider the initial access pattern to array "Input" has us striding through the data with a large gap. It is impossible to load multiple data elements simultaneously, but if we notice this an associative operation we can reorder. In the second form, we are stepping through sequentially controlled by "k" and getting about six times the performance on a Blackfin processor.
Here's an example of non-sequential access.

Figure 2. Two equivalent loops with non-sequential (top) and sequential (bottom) data access. The original form moves through memory jumping "NC" words at a time, which prevents the compiler from vectorizing.
When looking at loop structure, keep in mind that memory access costs can be significant. If a number of loops access the same data in external memory, it may make sense to unify those loops into one large loop. This will allow you to fetch the data only once.
If you still don't get the desired vectorization, it is time to explore vectorizing built-ins. Here is a vectorizing built-in that tells the compiler to perform a dual operation:
Compilers often fail to vectorize operations because they cannot determine if it is safe to do so. With built-ins, the compiler does not have to work out that vectorization is safe: You have told it explicitly how to proceed.
|