The CRAY-1 Computer System

The CRAY-1 Computer System

Richard M. RussellPresented by Andrew Waterman

ECE259 Spring 2008

Background

• CRAY-1 by no means first vector machine– 1960s: Westinghouse Solomon/ILLIAC IV– 1974: CDC STAR 100

• “I never, ever want to be a pioneer” --Cray– STAR 100, ILLIAC IV: who's this Amdahl dude?

• 1972: Cray Research formed after spat with CDC– Seymour Cray wanted to start from scratch on

8600; CDC brass, not so much• 1976: first CRAY-1 deployed at Livermore

CRAY-1 Hardware

Look Ma, No ASICs!

CRAY-1 Architecture

• 5-ton, vector uniprocessor• Word size = 64 bits• 80 MHz clock• 8MB RAM in 16 banks @ 20 MHz

– fcpu/fmem = 4 (!!)• Fairly RISCy 16- or 32-bit instructions

– Load/store; register-register operations

Scalar Operation and Octal Annoyance

• 108 A-registers for 24-bit address calculations

• 1008 B-registers serve as backing store for A-registers

• 108 S-registers for source/dest of scalar integer/FP insns

• T is to S as B is to A• 118 pipelined scalar FUs

– Address add, mult– Integer add, shift, logic, pop count– FP add, mult, reciprocal

Scalar Operation

• Protection without virtual memory– Base & limit address regs

• Ld $dest,$addr actually loads from $base+$addr• Program killed if $base+$addr >= $limit

• A handful of registers for interrupts, exceptions, etc.

OS and Front End

• cos (CRAY OS) handles job scheduling, storage management (tapes!), other I/O, checkpointing– Packaged with CAL (assembler)– ...and CFT (Fortran compiler), more later

• Command-line interface and job submission via separate front-end computer, e.g. VAX

Vector Operation (Finally!)

• 8x64-word V-registers• Vector Length Register

– Indicates # ops performed by vector insns– Set from contents of an A-register

• Vector Mask Register– Indicates which elements in vector to operate on– Set by vector test insns (e.g. VM[i] := ($Vk[i] == 0))

• 6 Vector FUs– integer add, shift, bitwise logic– FP via scalar FPU: add, mult, reciprocal

Vector Load/Store Architecture

• Big departure from STAR 100: register-register ops• CRAY-1 memory bandwidth == 80Mword/s ==

1word/cycle– If all 2-source insns are memory-memory, then

IPC=1/3! (and that assumes no bank conflicts!)– Solution: the RISC approach

• Combined with chaining (next), can sustain >> 1 flop/cycle

Chaining

• Pipeline bypass meets vectors• Consider SAXPY vector expression a*X+Y

– Slow approach: compute a*X (64 mults), then compute a*X+Y (64 adds)

• Total latency: 128+mult latency+add latency– since, in CRAY-1, all FUs are pipelined

– But... no fundamental serialization requirement• As soon as a*X[0] is computed, can compute

a*X[0]+Y[0]• Total latency: 64+mult latency+add latency

(speedup of almost 2)

Chaining Example

• Assume: 8-element vectors, single-cycle ops mul.ds $v2,$v3,$s1 add.d $v1,$v2,$v1• Without chaining: m m m m m m m m a a a a a a a a• With chaining: m m m m m m m m a a a a a a a a

Vector Startup Times

• For vector ops to be efficient enough to justify, startup overhead must be small

• CRAY-1 can issue a vector insn every cycle, assuming no structural hazards on FUs– Result: vector performance > scalar performance

for as few as four elements/vector

Cray Fortran Compiler (CFT)

• Important insight: hand-coding assembly sucks• The actual important insight: most vectorizable code

is of the embarrassingly-parallel variety– Even with 1970s compiler technology, innermost-

loop parallelism is low-hanging fruit– Exploit this—make the compiler do the heavy lifting

• CFT is pretty good for branchless inner loops• ...but doesn't even attempt to vectorize code with IFs

– So any use of the Vector Mask register must be hand-coded

• Upshot: a good start, but not quite there

Analysis

• Extremely fast computer for 1976• Thought experiment: what if CRAY-1's parameters

scaled with Moore's Law? (32 years == 21 doublings)– 200,000 transistors => 400 billion transistors– 8MB main memory => 16TB main memory– 80 MHz clock => petahertz? (if only)

• For a (merely) 2nd-generation vector processor, the CRAY-1 was ahead of its time (I think)– I'm not the only one: it was commercially

phenomenal• However, design techniques (discrete logic) are totally

unscalable

Questions?


ECE259 Spring 2008

The CRAY-1 Computer System


ECE259 Spring 2008

Documents

The CRAY-1 Computer System