|
Part 2 shows how to tune internal and external memory, as well as how to manage external bus access, DMA transfers, and interrupts.
Let us now elaborate on some memory and DMA discussions from earlier in the series, and then conclude with an example that illustrates a practical application of data buffer dynamics.
Memory considerations
Let's start with instruction placement. That is, where in memory should you place the code you write? We have already discussed the idea of placing as much code as possible in the fastest on-chip memory. When the entire application fits in internal memory, your work is easy. Unfortunately, this is not the norm. Instead, you will most likely have to partition code between internal memory and external, cacheable memory.
How should you start this process? A good first step is to measure the size (in bytes) of the critical code set. By "critical," we mean the subset of code that runs most often. A simulation- or execution-based code profiling tool provides a good starting point by illustrating which instructions are executed most frequently. It is this set of code that should be placed into the fastest internal SRAM (known as Level 1, or L1, memory).
One such example of a profiling tool that Analog Devices provides is the PGO Linker. This tool works with the VisualDSP++ development suite to provide a link-time profile-guided optimization (PGO). It couples an application's execution profile with program optimization techniques to yield the most efficient code layout (Figure 1). This tool is described in more detail in the PGO Linker application note.

Figure 1. The PGO Linker design flow
The more that your code is broken into smaller functions, the more flexibility the tool has to place the most frequently executed routines in the fastest memory. In an extreme scenario, if you had only one function that was larger than the size of internal memory, the tool would need to place the entire module into external memory.
What if the critical code footprint is too large to completely fit within L1 memory? Some processors feature an on-chip L2 memory that, while operating slower than L1, still performs considerably faster than external memory. If on-chip L2 is available, it should be the next destination for critical code. Otherwise, or if there's still insufficient space in on-chip memory, you'll need to use external memory and define it as cacheable.
As another important memory allocation technique, if you are using setup code, such as libraries that perform one-time setup routines, you may want to use a boot process to bring this code into on-chip memory, execute it, and then load your application code over the initialization code. Doing this will maximize the amount of code running from L1 memory in the steady-state application.
Also in some operating systems, such as uClinux, the entire operating system and the corresponding applications all reside in cacheable external memory. For more information on the extensive Blackfin collateral in this area, visit http://www.blackfin.uclinux.org.
If possible, try to allocate your code to an external memory bank that isn't being used for other purposes. For example, if you're connecting to a DDR SDRAM with four internal banks, select one bank in which to store only your code. You can be more flexible on this suggestion if you feel confident that most of your key code is in internal SRAM and/or cache. To help determine if the cache hit rate is adequate, explore the options provided by your processor and its tools suite. For instance, VisualDSP++ provides a Cache Viewer utility in its Simulator (Figure 2), while Blackfin processors contain Performance Monitor registers that can also be used to calculate the hit rate.
For more on this topic, See Optimizing for instruction caches

Figure 2. The Visual DSP++ Cache Viewer.
Data movement considerations
Now let's turn our attention to data movement. In general, for transfers that must occur at precise intervals in order to maintain a streaming system, register-based ("autobuffer") DMA is the best option. Descriptor-based DMA can also be configured to act like autobuffer DMA, and this is useful when audio and video streams need to be synchronized on the output of a system. The descriptors define what is transferred, as well as the source and/or destination of the data. If possible, the descriptors should be located in internal memory to maximize the efficiency of DMA controller fetches to access this information.
Consider a video-based application involving pixel decode and display. Assume a parallel video port provides clocking and data to drive either an LCD panel or a video encoder. Each frame of data has to be in place in memory for the video port to send out. A DMA channel fetches data from this memory buffer in time to keep the display fed with data. Taking a view external to the processor, if a DMA channel is held off and the video port's FIFO empties, glitches will corrupt the display, primarily because old data will be repeatedly sent to the display in the absence of new data being available. From the processor's point of view, the video port will underflow, a condition that can be detected by enabling an error/status interrupt.
Note that, if this were a video capture discussion instead of video display, DMA holdoff would result in a video port overflow error, indicating that incoming camera data is being overwritten before it's able to be processed or stored.
For a video display system, you will need at least two buffers (and usually more) on which to operate. There are a few reasons for this. First, you want to be able to write to one buffer while the video port accesses the other buffer for display. We say "buffers" because you will have multiple frames of processed data to send out. If possible, you should perform all necessary operations on data while it is in L1 memory. Moreover, it is best to use a memory-to-memory DMA channel to move the processed data to external memory.
Where should video buffers be placed in external memory? As with all trade-offs, it depends, but here is a rule of thumb: if the display refresh rate and the pixel processing rate are among the highest rates in your system, you should place the input and output buffers in separate external DRAM banks. This will ensure you have the lowest number of page opens and closes in DRAM. The external memory controller can keep track of which rows are open across the separate banks, so you should do all you can to exploit this feature. If there's not much change in the display from frame to frame, you may be able to place the input and output buffers in the same bank.
One more point: when audio is present along with video in a system, try to locate the audio source in internal memory (L1 or on-chip L2 memory). This will help avoid peripheral underruns or overruns by reducing competition with the video flows into and out of external memory.
This concludes our article series on performance tuning. Hopefully the series has generated some new ideas for how you can optimize your embedded processing applications, reducing debugging time and time-to-market in the process.
|