428
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
10.7 VLIW M achines
T here is an architecture that is in a sense competitive with superscalar architec- tures, referred to as the
VLIW
Very Long Instruction Word architecture. In VLIW machines, multiple operations are packed into a single instruction word
that may be 128 or more bits wide. T he VLIW machine has multiple execution units, similar to the superscalar machine. A typical VLIW CPU might have two
IUs, two FPUs, two loadstore units, and a BPU. It is the responsibility of the compiler to organize multiple operations into the instruction word. T his relieves
the CPU of the need to examine instructions for dependencies, or to order or reorder instructions. A disadvantage is that the compiler must out of necessity be
pessimistic in its estimates of dependencies. If it cannot find enough instructions to fill the instruction word, it must fill the blank spots with NOP instructions.
Furthermore, VLIW architectural improvements require software to be recom- piled to take advantage of them.
T here have been a number of attempts to market VLIW machines, but mainly, VLIW machines have fallen out of favor in recent years. Performance is the pri-
mary culprit, for the reasons above, among others.
10.8 Case Study: The Intel IA-64 M erced Architecture
T his section discusses a microprocessor family in development by an alliance between Intel and Hewlett-Packard, which is hoped will take the consortium
into the 21st century. We first look into the background that led to the decision to develop a new architecture, and then we look at what is currently known
about the architecture. T he information in this section is taken from various publications and Web sites, and has not been confirmed by Intel or
Hewlett-Packard.
10.8.1
BACKGROUND—THE 80
X
86 CISC ARCHITECTURE T he current Intel 80x86 architecture, which runs on some 80 of desktop com-
puters in the late 1990’s, had its roots in the 8086 microprocessor, designed in the late 1970’s. T he architectural roots of the family go back to the original Intel
8080, designed in the early 1970’s. Being a persistent advocate of upward com- patibility, Intel has been in a sense hobbled by a CISC architecture that is over 20
years old. Other vendors such as Motorola abandoned hardware compatibility for modernization, relying upon emulators to ease the transition to a new ISA.
In any case, Intel and Hewlett-Packard decided several years ago that the x86
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
429
architecture would soon reach the end of its useful life, and they began joint research on a new architecture. Intel and Hewlett-Packard have been quoted as
saying that RISC architectures have “run out of gas,” so to speak, so their search led in other directions. T he result of their research led to the IA-64, which stands
for “Intel Architecture-64.” T he first of the IA-64 family is known by the code name
Merced
, after the Merced River, near San Jose, California.
10.8.2
THE MERCED: AN EPIC ARCHITECTURE Although Intel has not released significant details of the Merced ISA, it refers to
its architecture as Explicitly Parallel Instruction Computing, or
EPIC
. Intel takes pains to point out that it is not a VLIW or even an LIW machine, perhaps out of
sensitivity to the bad reputation that VLIW machines have received, however, some industry analysts refer to it as “the VLIW-like EPIC architecture.”
Features While exact details are not publicly known as of this writing, published sources
report that the Merced is expected to have the following characteristics:
• 128 64-bit GPRs and perhaps 128 80-bit FPRs; • 64 1-bit predicate registers explained later;
• Instruction words contain three instructions packed into one 128-bit par- cel;
• Execution units, roughly equivalent to IU, FPU, and BPU, appear in mul- tiples of three, and the IA-64 will be able to schedule instructions into these
multiples; • It will be the burden of the compiler to schedule the instructions to take
advantage of the multiple execution units; • Most of the instructions seem to be RISC-like, although it is rumored that
the processor will still execute 80x86 binary codes, in a dedicated execu- tion unit, known as the DXU;
• Speculative loads. T he processor will be able to load values from memory well in advance of when they are needed. Exceptions caused by the loads
are postponed until execution has proceeded to the place where the loads
430
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
would normally have occurred • Predication not prediction, where both sides of a conditional branch in-
struction are executed and the results from the side not taken are discarded. T hese latter two features are discussed in more detail later.
The Instruction Word T he 128-bit instruction word, shown in Figure 10-13, has three 40-bit instruc-
tions, and an 8-bit template. T he template is placed by the compiler to tell the CPU which instructions in and near that instruction word can execute in parallel
, thus the term “Explicit.” T he CPU need not analyze the code at runtime to expose instructions that can be executed in parallel because the compiler deter-
mines that ahead of time. Compilers for most VLIW machines must place NOP instructions in slots where instructions cannot be executed in parallel. In the
IA-64 scheme, the presence of the template identifies those instructions in the word that can and cannot be executed in parallel, so the compiler is free to sched-
ule instructions into all three slots, regardless of whether they can be executed in parallel.
T he 6-bit predicate field in each instruction represents a tag placed there by the compiler to identify which leg of a conditional branch the instruction is part of,
and is used in branch predication.
Branch Predication Rather than using branch prediction, the IA-64 architecture uses branch predica-
tion to remove penalties due to mis-predicted branches. When the compiler encounters a conditional branch instruction that is a candidate for predication, it
Figure 10-13 The 128-bit IA-64 instruction word.
8 bit Template
40 bit Instruction
40 bit Instruction
40 bit Instruction
6 bit Predicate
7 bit GPR
7 bit GPR
7 bit GPR
13 bit Op Code
128 bits
40 bits
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
431
selects two unique labels and labels the instructions in each leg of the branch instruction with one of the two labels, identifying which leg they belong to. Both
legs can then be executed in parallel. T here are 64 one-bit predicate registers, one corresponding to each of the 64 possible predicate identifiers.
When the actual branch outcome is known, the corresponding one-bit predicate register is set if the branch outcome is T RUE, and the one-bit predicate register
corresponding to the FALSE label is cleared. T hen the results from instructions having the correct predicate label are kept, and results from instructions having
the incorrect mis-predicted label are discarded.
Speculative Loads T he architecture also employs
speculative loads
, that is, examining the instruc- tion stream for upcoming load instructions and loading the value ahead of time,
speculating that the value will actually be needed and will not have been altered by intervening operations. If successful, this eliminates the normal latency inher-
ent in memory accesses. T he compiler examines the instruction stream for candi- date load operations that it can “hoist” to a location earlier in the instruction
sequence. It inserts a check instruction at the point where the load instruction was originally located. T he data value is thus available in the CPU when the
check instruction is encountered.
T he problem that is normally faced by speculative loads is that the load opera- tion may generate an exception, for example because the address is invalid. How-
ever, the exception may not be genuine, because the load may be beyond a branch instruction that is not taken, and thus would never actually be executed.
T he IA-64 architecture postpones processing the exception until the check instruction is encountered. If the branch is not taken then the check instruction
will not be executed, and thus the exception will not be processed.
All of this complexity places a heavy burden on the compiler, which must be clever about how it schedules operations into the instruction words.
80x86 Compatibility Intel was recently granted a patent for a method, presumably to be used with
IA-64, for supporting two instruction sets, one of which is the x86 instruction set. It describes instructions to allow switching between the two execution
432
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
modes, and for data sharing between them. Estimated Performance
It has been estimated that the first Merced implementation will appear sometime in the year 2000, and will have an 800 MHz clock speed. Goals are for it to have
performance several times that of current-generation processors when running in EPIC mode, and that of a 500 MHz Pentium II in x86 mode. Intel has stated
that initially the IA-64 microprocessor will be reserved for use in high-perfor- mance workstations and servers, and at an estimated initial price of 5000 each
this will undoubtedly be the case.
On the other hand, skeptics, who seem to abound when new technology is announced, say that the technology is unlikely to meet expectations, and that the
IA-64 may never see the light of day. Time will tell.
10.9 Parallel Architecture