|
Caches are increasingly common in DSPs, but many DSP programmers are unfamiliar with their operation. This article explains how caches work, using the two-level cache in TI's C64x as an example. It also outlines the causes for cache misses. In part 2, the author will explain how to minimize cache misses, giving some practical examples.
Why use caches?
High-performance DSP systems process digital data quickly. For example, the fastest members of the C6000 family, the TMS320C64x, operate at a clock rate of 1 GHz. Processing data at these extremely high rates requires fast memory that is directly connected to the Central Processing Unit (CPU). However, a bandwidth dilemma has occurred with the dramatic increase in processor speed. While processor speed has increased dramatically, memory speed has not. Advanced process technologies have allowed both the CPU clock speed to increase and more memory to be integrated on-chip, but the access time of the on-chip memory has not decreased proportionally. By nature, SRAMs are large arrays of static storage elements. As you increase the number of storage elements, you have more capacitance on the data lines between these storage elements and the CPU. The more capacitance you have, the slower the switching time. Thus, the speed advances due to process technology are often mitigated by the increased capacitance of large memory arrays. Therefore, the memory to which the CPU is connected often becomes a processing bottleneck.
Cache memories can greatly reduce the CPU to memory processing bottleneck. Caches are small, fast memory that reside between the CPU and slower system memory. The cache provides code and data to the CPU at the speed of the processor while automatically managing the data movement from the slower main memory which is frequently located off-chip.
Many high-end processors, including the C64x DSP, employ a two-level cache architecture for on-chip program and data accesses. In this hierarchy, the CPU interfaces directly to small level-one program (L1P) and data (L1D) caches. Dedicated L1 caches eliminate conflicts for the memory resources between the program and data busses, thus increasing speed. These L1 caches operate at the same speed as the CPU. The L1 memories are also connected to a larger, second-level memory of on-chip memory called L2. L2 is a unified memory block that contains both program and data. This unified L2 memory provides flexible memory allocation between program and data for accesses outside of L1. The L2 memory acts as a bridge between the L1 memory and memory located off-chip. See Figure 1.

1. C64x two-level cache architecture.
Cache memories are still a new concept to many DSP programmers. This two part article provides a high-level overview of cache-based system performance. It covers cache fundamentals, provides an overview of the C64x cache architecture, discusses code behavior in caches, and points out techniques for optimizing code for cache-based systems.
Cache concepts
Caches are based on two concepts: temporal locality (if an item is referenced, it will be referenced again soon) and spatial locality (if an item is referenced, items that are located nearby will also be referenced soon). Figure 2 shows a piece of C code that illustrates both these concepts.

2. C code for a dot product.
In this loop, we are multiplying elements in the "x" array by corresponding elements in the "y" array, and we are keeping a running sum of the product. Note that the same piece of code is executed ten times in a row. The data value "sum" is also accessed ten times in a row. This illustrates temporal locality; the same pieces of code and data are accessed several times in succession. However, the individual elements in the x and y arrays are accessed only once. The data required for each iteration of the loop is the succeeding element in the array. In the first iteration, x[0] and y[0] are accessed; in the next iteration, x[1] and y[1] are accessed, etc. This is an example of spatial locality. Note that while the values within the individual arrays are in close proximity to each other, the two arrays may not be in close proximity to each other.
Caches effectively utilize temporal locality—the CPU is kept close to the instructions and data it is currently using. Caches also take advantage of spatial locality—instructions and data that are in close proximity to those being used currently by the CPU are also kept close to the CPU. Those instructions and data that have not been used recently are located farther away from the CPU in a separate level of memory. This is not very different from a flat memory architecture. When we want the fastest possible performance in a flat memory model, we keep the code and data we need in on-chip memory. Those instructions and data that are not of immediate importance can be kept in slower, off-chip memory. The main difference between a flat memory model and a cache model is how the memory hierarchy is managed. In the flat memory model, the programmer manages the hierarchy (brings new code and data on-chip when needed) while in a cache system, the cache controller manages the hierarchy.
Since the cache controller greatly simplifies managing the memory hierarchy, a programmer can quickly develop his/her system. This is especially true in large systems where it is very complex to program the data flow using a Direct Memory Access (DMA) approach, or very inefficient to use the CPU to manage the data flow which would be necessary in a flat memory architecture. Thus, cache-based systems provide a large advantage over flat memory systems in terms of data management simplicity and rapid development time.
|