Research at the Computer Engineering Laboratory of Delft University of Technology

Research at theComputer Engineering Laboratory of

Delft University of Technology

Ben Juurlink

Universidad Polytecnica de

Catalunya

Outline General Information

Group Location Group Formation Group Funding

Group Interests Group Projects

Molen -Iliad MOVE Pamela

PUB library Concluding Remarks


Catalunya

Group Location Delft University of Technology

Aerospace Engineering Applied Sciences Architecture Civil Engineering and

Geosciences Design, Engineering and

Production Information Technology and

Systems Technology, Policy and

Management

Computer Science Electrical Engineering Mathematics

Telecommunication Software Engineering Microelectronics Energy Mediamatica Mathematical Analysis Control, Risk, Optimization,

Stochastics, and Systems

7 faculties13,000 student2,100 researchers


Catalunya

Group Formation

18%

5%

10%

37%

30%

7 Faculty Members

2 Post-docs

4 Scientific/ administrative staff

15 PhD Students

12 MSc Students


Catalunya

Group Funding 94-98 (in Kfl)

E.E. Graduate Program

(1000)

Dutch "NSF" (1900)

TU Delft Special Projects

(640)

Ministry of Economic

Affairs (700) Philips

(1300)

IBM USA (200)

Lucent USA (300)

Total financing: 6000 Kfl


Catalunya

Group Output (‘94-’98) Degrees:

PhD Theses......................................................................... 9 Eng. degrees........................................................................ 5 MSc......................................................................................

87 Publications:

Books/Chapters.................................................................... 7 Journal articles.....................................................................

47 Conference papers...............................................................

165 Patents.................................................................................

50 Five start-ups


Catalunya

Computer Engineering Computer Engineering: Analysis of data processing

requirements for electronic data processing units and systems and the design (synthesis) of their architecture, implementation, and realization

Architecture: Determine the function to perform Implementation: Establish a method to achieve the

function Realization: Use available means to materialize the

method


Catalunya

Computer Engineering Interests


Catalunya

MOLEN : Embedded system architecture, multimedia, Java.

MOVE : Embedded system synthesis, compilers, hardware software co-design.

PAMELA : Performance analysis and languages.

ILIAD : Computer architecture, implementation, computer

arithmetic, switches.

Group Projects


Catalunya

MOLENEmbedded System Design

Topics: Embedded Processor Architectures Multimedia Java Embedded System Tools Embedded Agents

Current Contributions: Java Processor Multimedia Instructions Specialized Units FPGA Units

Future Directions: Reconfigurable embedded processors


Catalunya

Molen Multimedia Instructionand Functional Unit

Published at EUROMICRO’98 Motion estimation, sum of absolute differences:

s = 0;

for (j=0; j<h; j++){

if ((v = p1[0]-p2[0])<0) v = -v; s += v;

if ((v = p1[1]-p2[1])<0) v = -v; s += v;

...

if ((v = p1[15]-p2[15])<0) v = -v; s += v;

if (s >= distlim) break;

p1 += lx;

p2 += lx;

} Formula:

15

0

15

0i jjsyirxjyix BA ))(,)((),(


Catalunya

Efficient Implementation ofthe SAD Operation

Straightforward approach: Compute Ai-Bi for all pairs of

pixels Take absolute values Accumulate absolute values

Cost: 4 cycles

MOLEN solution: Observation: |Ai-

Bi| = max{Ai,Bi}-min{Ai,Bi} Problem: determine and negate

min(Ai-Bi) takes > 1 cycle

Solution: pass min(Ai,Bi) to accumulate stage and correct

Cost: 3 cycles


Catalunya

MOVE

Solution Space

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Parallel object code

Semi-automatic generation of application specific processors

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Architectureparameters


Catalunya

MOVE

Current Contributions: Transport triggered architecture Operational design framework (add any unit you like, no

restrictions) Several cheap designs (data logger, video-enhancer, MPEG-

decoder, wireless communications) Future Directions:

Tune your application to suit your processor System design Multiprocessor TTA Low-power processors


Catalunya

Transport Triggered Architecture Published in e.g. Jnl. of Systems Architecture ‘99 Transport triggered architecture:

Only one instruction: MOVE! FU operations are triggered by moving data to their input ports

Example:add r1,r2,r3

sub r4,r2,r6

st r4,r1 TTA code:

r2->O1add.alu1; r3->O2add.alu1; r2->O1sub.alu2; r6->O2sub.alu2

Radd.alu1->r1; Rsub.alu2->r4

r1->O1st.ls; r4->O2st.ls After bypassing:

r2->O1add.alu1; r3->O2add.alu1; r2->O1sub.alu2; r6->O2sub.alu2

Radd.alu1->r1; Rsub.alu2->r4; Radd.alu1->O1st.ls; Rsub.alu2->O2st.ls


Catalunya

PAMELAPerformance Analysis of Computer Systems

Current Contributions: Specialized Languages Simulation Tools & Methodology Parallel Algorithms Delft Architecture Workbench

Future Directions: Complete the Delft Architecture Workbench

Implementation

Simulation

Evaluation

ArchitectureAnalyticEvaluation


Catalunya

Static Branch Prediction Data dependent branches:

for (i=0; i<n-1; i++){ minIndex = i; for (j=i+1; j<n; j++) if (a[j] < a[minIndex]) B minIndex = j; swap(&a[i], &a[minIndex]);}

Oblivious static branch predictor: B will be taken 50% Bernoulli model with truth probability p (profiling): large variance prediction

error New model based on alternating renewal processes reduces variance

prediction error by order of magnitude Let D (U) = consecutive number of 0’s (1’s) Then

Example: 110011001100 Then E[PPAA] = 0.5, Var[PPAA] = 0

EE[[PPAA] = ] = EE[[UU]/(]/(EE[[DD]+]+EE[[UU])])

VarVar[[PPAA] = (] = (EE[[DD]]2 2 VarVar[[UU]] ++EE[[UU]]2 2 VarVar[[DD])])

((EE[[DD]+]+EE[[UU])])22


Catalunya

IliadHigh Performance General Purpose Computers

Topics: Uni & Multiprocessors Internet Processing Computer Design High Speed Switches

Current Contributions: Instruction level parallel machines (Superscalar, SCISM) New “Complex” Instructions New Designs of Arithmetic Processing New Switch Design

Future Directions: New Architectural paradigm


Catalunya

Complex Streamed Instructions

See PACT’01, EuroPar’01 Drawbacks of MMX-like extensions:

Multimedia (MM) register size architecturally visible and fixed. Ways out: add MM FUs and increase issue width

• expensive increase MM register size

• existing codes have to be recompile/rewritten

• not beneficial due to small sub-matrices overhead for converting between packed data types and alignment

Proposed solution: Complex Streamed Instructions (CSI) two-dimensional vector (stream) architecture, streams of arbitrary length stream is specified by set of stream control registers conversion between data types in hardware no loop control and address generation overhead


Catalunya

The Need for a Parallel Computation Model Parallel computing has not been very successful One reason: lack of a standard parallel computation model Properties that a suitable parallel computation model

should possess: Scalability Portability Predictability

Model proposed by Valiant (1990) Bulk-Synchronous Parallel (BSP) model


Catalunya

BSP Model BSP architectural model

set of p processors communicating by sending point-to-point messages

BSP programming model computations proceed in phases

(supersteps), separated by barrier synchronizations

BSP cost model superstep takes time

w + g · h + L

where

w: max. work

h: max. messages (h-relation)

g: bandwidth reciprocal

L: latency/synchronization cost

P

P

M

M

P

M

P

M

communicationnetwork

barrier sync

barrier sync


Catalunya

PUB Library Paderborn University BSP (PUB) library (IPDPS’99)

basics: SPMD no receive operation; barrier synchronization signifies end of all

communication operations only non-blocking communication primitives buffered and unbuffered communication message is placed in buffer associated with destination

processor from which it can be retrieved after the next barrier sync

Additional features: (non-blocking) collective communication primitives ability to partition the processors running different BSP computations on the same system (in

different threads)


Catalunya

PUB ExampleParallel Binary Multisearch

Search butterfly:

11 13 14 17 18 19 21 23

11 14 18 21

19191313

17 17 17 17

Proc 0 Proc 1 Proc 2 Proc 3

Localsearch

tree

11169

14


Catalunya

Parallel Binary Multisearch Using PUBvoid bin_search(int d, int m){ for (i=new_m=0; i<m; i++) if (query[i]<=gkey[d]&&inRight(d,me) || query[i]>gkey[d]&&inLeft(d,me)) bsp_send(&bsp,Opposite(d,me),&query[i],sizeof(int)); else query[new_m++] = query[i]; bsp_sync(&bsp);

for (i=0;i<bsp_nmsgs(&bsp);i++){ msg = bsp_getmsg(&bsp,i); query[new_m++] = (int)(*bspmsg_data(msg)); } if (d==0) local_search(new_m,query,n,key); else bin_search(d-1,new_m);}


Catalunya

Concluding Remarks Not discussed:

testing ISA extensions for sparse matrix computations computer arithmetic using single-electron technology reconfigurable processors network processors low power ...

For further information, please contact me ([email protected]) or see ce.et.tudelft.nl ce.et.tudelft.nl/~benj www.upb.de/~pub

Thank You

Documents

Research at the Computer Engineering Laboratory of Delft University of Technology