Newsletter

21st century multiprocessor design, part 2

You don't need clock trees, OSs, or C compilers. In fact, you should avoid them.

Page 1 of 3

DSP DesignLine

In the first part of this series, we explained the limitations of 20th-century multiprocessors and showed how to break through those limitations. In this second part, we explore the benefits of eliminating clocks and operating systems. We also examine how to save power by choosing the right instruction set and by shutting down cores.

Real-time clocks
As traditional processors have grown in processing speed and complexity, their behavior has become harder to predict. In other words, the time to process code is indeterminate and will vary from one run to the next. This is largely due to the introduction of increasingly larger caches used by the processor to reduce external memory accesses. Thus, on one loop through the code the instructions are all fetched externally, but on the next they are contained within the cache. At the same time, as processor complexity has grown, the number of CPU registers has increased as well. Accordingly, the amount of time required to save the contents of those registers during interrupt handling has increased. All of this makes modern processors ill-suited for embedded applications, to say nothing of the large memory requirements and sheer chip cost.

Embedded processors have always stressed the ability to handle real time applications, that is, to process code in a guaranteed time slot, and to handle events within a tightly controlled (and shrinking) time allotment. Single processor chips use a real time clock, supplied by an external reference, to set up and control those tasks. But what is the ideal arrangement in a multicore chip?

Thinking about the application as a set of related tasks and subtasks, with cores assigned to each, provides an answer. Modern applications, especially those that are multimedia intensive, are not characterized by only one or two real-time tasks. Instead, many if not most of the tasks have a real time component to them. This means that most of the cores will need access to the real time clock. One way to do this is to give one core access to the real time clock, and then have this core pass status signals to the other cores. Alternatively, each core can access the reference clock directly. Of the two, the latter is a much better solution.

In part 1, we saw how status signals are not needed to transfer data between cores. If status signals can be eliminated by giving each core its own access to the real time clock, that goes a long way to eliminating the status signal form of communication between cores altogether. Notice we are not suggesting that a system clock signal be distributed across the cores—this would require millions of nodes to be switched synchronously to the beat of that clock. For the real time clock to be effective, only a handful of nodes in each core must be switched, and the effect on power dissipation is negligible. A simple counter on each node is more than sufficient to make each node self-sufficient in terms of real time processing.

Low power by design
As more and more embedded processor chips find themselves in mobile applications, the requirement for low power dissipation has become critically important. In traditional designs this is achieved through excruciating attention to detail, carefully determining the speed at which each signal path must operate and then choosing transistor sizes appropriate to that speed. Only the highest speed paths are implemented with large, power-hungry transistors.

But the multicore chip, with the ability to start and stop core processors as data is presented or denied, has a much simpler power-saving mechanism. Cores that are not processing data are not running and therefore are not dissipating any power. As explained in part 1, cores can be powered on and off automatically, without any intervention by the program. Not only is this approach simpler, it is also more effective: The benefit of shutting down cores is much larger than the impact of carefully sizing signal paths.

This approach has a second benefit. Because of the automatic synchronization of data passing between cores, there is no reason to make the cores themselves synchronous. That means there is no reason to have a central clock. Data transfers always take place at the highest possible speed—an external clock adds nothing but complexity. Now the central clock is replaced by an individual clock for each core—a simple ring oscillator—that runs as fast as the native speed of the silicon allows. No central clock means there is no giant clock tree with millions of transistor nodes dissipating power at each tick. Instead, the tiny individual clock oscillators run on each core, but only if that core is running. If a core has been stopped because data is either unavailable at its shared register or has not yet been read by a neighbor, the ring oscillator is also stopped. Clock dissipation only occurs in running cores, and even then these are fully asynchronous with regard to each other so that the power dissipation is spread over time.

In a chip such as this, with dozens of core processors, only a fraction of the cores are running at any given time. Some of these cores will be off for significant amounts of time because the chip is in a mode that does not run tasks involving those cores. But even the cores that are running are doing so in short spurts, executing code as fast as silicon will allow, and then turning back off when the data is exhausted. In this type of environment, we estimate only a third of the cores would be running at any given instant. A few nanoseconds later, a different group of cores would be active—but still only about a third of them. (See Figure 1.) This effectively reduces the power dissipation of the entire chip by a factor of 2/3 while at the same time ensuring that each core runs at the maximum possible speed of the silicon.


1. Typically, only about 1/3 of the chip is active at any given time.


Page 2: Instruction sets  

Page 1 | 2 | 3







 Featured Jobs
Videon Central seeking VP of Engineering in State College, PA

Protingent Staffing seeking Electrical Engineer in Mountain View, CA

True Circuits seeking Analog-Mixed-Signal IC Layout Engr in Los Altos, CA

ON Semiconductor seeking Sr Analog Design Engineer in Colorado Springs, CO

SanDisk seeking Sr Process Integration Engr in Milpitas, CA

More jobs on EETimesCareers
 Sponsor
 CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS:

 SPONSOR

 RECENT JOB POSTINGS
For more great jobs, career related news, features and services, please visit EETimes' Career Center.