The Pentium still kicks butt. The Pentium
IV, Intel's latest version built on the microburst architecture, is
designed to deliver high-level performance for next generation
clock rates. Interestingly, the Pentium IV does not represent a
performance boost for the current Pentium III. In fact, performance
levels are the same. But the Pentium IV deploys the Pentium
architecture over higher resolution fabs with multi-Gigahertz clock
rates. The first Pentium IV is due out next year running at 1.4
GHz, with higher clock rates to follow. A 2.0 GHz version is
already up and running.
Higher resolution fabs and faster clocks require different chip
design techniques for a processor architecture. Line delays,
capacitance, and power all have to be carefully handled, especially
for an architecture to be able to scale up on the next process
generations and clock rates. The Pentium IV was designed to do just
that. It is a complete redesign of the Pentium CPU, including an
additional 144 Streaming SIMD Extension instructions (SSE2) for
multimedia applications. SSE2 supports:
- 128-bit floating-point
- Video capture
- Speech
- Imaging
- 3D graphics rendering.
The Pentium IV is designed for a 0.13 u CMOS process, not
Intel's current 0.18 u process. Higher densities allow the use of a
larger cache memory and larger intermediate cached instruction
storage. For example, it has a 256 KB unified L2 cache and can
store (or queue) up to 12 K micro operations, the mini RISC wide
words (almost like microcode) that program the CPU logic. This will
be a hefty chip with 42 M transistors, compared to 28 M for the
earlier Pentium III.
Aiming For Clock Rate Performance
The Pentium IV implementation is tuned for high-clock rate
execution. Design features include:
- 400 MHz Front-Side Bus
- 256 KB Unified L2 cache
- 12 K cached decoded uOPs
- Queued uOPS (instead of reservation stations) for execution
units
- Support for RDRAM (RamBus memory).
The new front-side bus (FSB) provides a higher bandwidth
connection between the processor and the off-chip memory controller
and main memory. This is a sophisticated split-transaction bus
(commands are separated from the actual operation on the bus),
pipelined for higher efficiencies. The FSB supports 128-byte line
transfers with 64-byte accesses and can deliver a 3.2 Gbyte/sec
transfer rate.
To accommodate higher clock rates, the Pentium IV was designed
with a very long pipeline20 stages as compared to Pentium's
10. The long pipeline allows designers to segment operations into
multiple stages (right word?) and fit them into the narrow clock
periods. To keep performance up with the longer pipeline, Intel
engineers reworked the Pentium's branch prediction logic to get a
higher hit rate. The branch hit rate is reputed to be in the low 90
percent range.
The key to the Pentium IV's high frequency performance is its
Execution Trace Cache. Here, unlike most RISC processors, the
Pentium IV caches decoded instructions, eliminating the
complex decoding stages, except for the first pass through the
thread.
The Pentium IV CPU gets more bang from its clocks by running the
core frequency at twice the CPU frequency. This lets the Integer
execution units execute at twice the main clock rate. The CPU can
execute 4 integer operations per clock. The execution units also
include a Load Unit, a Store Unit, an FPU Move/Store Unit, and an
FPU MUL/ADD/MMX Unit.
Superscalar Operation
Like earlier Pentium's, the Pentim IV is a superscalar CPU and
can issue multiple instructions per clock cycle, up to 3 uOps from
the Trace Cache per clock. The uOps are decoded instructions that
are then passed through the Rename/Alloccation logic, which maps
the operations into available registers (the Pentium IV has 128
128-bit registers). It (the Pentium IV?) then assigns them to one
or more execution units and passes them to the uOP Queues, where
the operations are queued up waiting for the resources to be
executed. During the next stage, the Schedulers schedule the
operations for execution in the addressed execution units.
The CPU supports out-of-order, speculative execution.
Instruction operations can be executed out of instruction order (if
there is no resource conflicts or dependencies), and the logic will
choose a path from a branch (speculate on the winning branch) for
execution. If the branch choice is wrong, the logic can roll back
the trace execution and "replay" by going down the correct
execution path. This speculative execution is made possible by the
Branch Target Buffer (4 K addresses), which buffers past execution
choices, and the Trace Cache which caches the decoded instructions
that were executed.
The Pentium IV supports very "deep" speculative execution. It
has the resources to keep 126 instructions "in flight," i.e., in
execution mode, three times more than does the older Pentiums based
on the P6 microarchitecture. Of those instructions, the Pentium IV
can juggle 48 loads and 24 stores.