Upload
tauret
View
29
Download
0
Embed Size (px)
DESCRIPTION
Research at the Computer Engineering Laboratory of Delft University of Technology. Ben Juurlink. Outline. General Information Group Location Group Formation Group Funding Group Interests Group Projects Molen -Iliad MOVE Pamela PUB library Concluding Remarks. - PowerPoint PPT Presentation
Citation preview
Universidad Polytecnica de
Catalunya
Outline General Information
Group Location Group Formation Group Funding
Group Interests Group Projects
Molen -Iliad MOVE Pamela
PUB library Concluding Remarks
Universidad Polytecnica de
Catalunya
Group Location Delft University of Technology
Aerospace Engineering Applied Sciences Architecture Civil Engineering and
Geosciences Design, Engineering and
Production Information Technology and
Systems Technology, Policy and
Management
Computer Science Electrical Engineering Mathematics
Telecommunication Software Engineering Microelectronics Energy Mediamatica Mathematical Analysis Control, Risk, Optimization,
Stochastics, and Systems
7 faculties13,000 student2,100 researchers
Universidad Polytecnica de
Catalunya
Group Formation
18%
5%
10%
37%
30%
7 Faculty Members
2 Post-docs
4 Scientific/ administrative staff
15 PhD Students
12 MSc Students
Universidad Polytecnica de
Catalunya
Group Funding 94-98 (in Kfl)
E.E. Graduate Program
(1000)
Dutch "NSF" (1900)
TU Delft Special Projects
(640)
Ministry of Economic
Affairs (700) Philips
(1300)
IBM USA (200)
Lucent USA (300)
Total financing: 6000 Kfl
Universidad Polytecnica de
Catalunya
Group Output (‘94-’98) Degrees:
PhD Theses......................................................................... 9 Eng. degrees........................................................................ 5 MSc......................................................................................
87 Publications:
Books/Chapters.................................................................... 7 Journal articles.....................................................................
47 Conference papers...............................................................
165 Patents.................................................................................
50 Five start-ups
Universidad Polytecnica de
Catalunya
Computer Engineering Computer Engineering: Analysis of data processing
requirements for electronic data processing units and systems and the design (synthesis) of their architecture, implementation, and realization
Architecture: Determine the function to perform Implementation: Establish a method to achieve the
function Realization: Use available means to materialize the
method
Universidad Polytecnica de
Catalunya
MOLEN : Embedded system architecture, multimedia, Java.
MOVE : Embedded system synthesis, compilers, hardware software co-design.
PAMELA : Performance analysis and languages.
ILIAD : Computer architecture, implementation, computer
arithmetic, switches.
Group Projects
Universidad Polytecnica de
Catalunya
MOLENEmbedded System Design
Topics: Embedded Processor Architectures Multimedia Java Embedded System Tools Embedded Agents
Current Contributions: Java Processor Multimedia Instructions Specialized Units FPGA Units
Future Directions: Reconfigurable embedded processors
Universidad Polytecnica de
Catalunya
Molen Multimedia Instructionand Functional Unit
Published at EUROMICRO’98 Motion estimation, sum of absolute differences:
s = 0;
for (j=0; j<h; j++){
if ((v = p1[0]-p2[0])<0) v = -v; s += v;
if ((v = p1[1]-p2[1])<0) v = -v; s += v;
...
if ((v = p1[15]-p2[15])<0) v = -v; s += v;
if (s >= distlim) break;
p1 += lx;
p2 += lx;
} Formula:
15
0
15
0i jjsyirxjyix BA ))(,)((),(
Universidad Polytecnica de
Catalunya
Efficient Implementation ofthe SAD Operation
Straightforward approach: Compute Ai-Bi for all pairs of
pixels Take absolute values Accumulate absolute values
Cost: 4 cycles
MOLEN solution: Observation: |Ai-
Bi| = max{Ai,Bi}-min{Ai,Bi} Problem: determine and negate
min(Ai-Bi) takes > 1 cycle
Solution: pass min(Ai,Bi) to accumulate stage and correct
Cost: 3 cycles
Universidad Polytecnica de
Catalunya
MOVE
Solution Space
cost
exec
. tim
e
x
x
x
x
xx
x
xx
x
x
x
x
x
x
xx x
x
x
Parallel object code
Semi-automatic generation of application specific processors
OptimizerOptimizer
Parametric compilerParametric compiler Hardware generatorHardware generator
feedbackfeedback
Userintercation
Architectureparameters
Universidad Polytecnica de
Catalunya
MOVE
Current Contributions: Transport triggered architecture Operational design framework (add any unit you like, no
restrictions) Several cheap designs (data logger, video-enhancer, MPEG-
decoder, wireless communications) Future Directions:
Tune your application to suit your processor System design Multiprocessor TTA Low-power processors
Universidad Polytecnica de
Catalunya
Transport Triggered Architecture Published in e.g. Jnl. of Systems Architecture ‘99 Transport triggered architecture:
Only one instruction: MOVE! FU operations are triggered by moving data to their input ports
Example:add r1,r2,r3
sub r4,r2,r6
st r4,r1 TTA code:
r2->O1add.alu1; r3->O2add.alu1; r2->O1sub.alu2; r6->O2sub.alu2
Radd.alu1->r1; Rsub.alu2->r4
r1->O1st.ls; r4->O2st.ls After bypassing:
r2->O1add.alu1; r3->O2add.alu1; r2->O1sub.alu2; r6->O2sub.alu2
Radd.alu1->r1; Rsub.alu2->r4; Radd.alu1->O1st.ls; Rsub.alu2->O2st.ls
Universidad Polytecnica de
Catalunya
PAMELAPerformance Analysis of Computer Systems
Current Contributions: Specialized Languages Simulation Tools & Methodology Parallel Algorithms Delft Architecture Workbench
Future Directions: Complete the Delft Architecture Workbench
Implementation
Simulation
Evaluation
ArchitectureAnalyticEvaluation
Universidad Polytecnica de
Catalunya
Static Branch Prediction Data dependent branches:
for (i=0; i<n-1; i++){ minIndex = i; for (j=i+1; j<n; j++) if (a[j] < a[minIndex]) B minIndex = j; swap(&a[i], &a[minIndex]);}
Oblivious static branch predictor: B will be taken 50% Bernoulli model with truth probability p (profiling): large variance prediction
error New model based on alternating renewal processes reduces variance
prediction error by order of magnitude Let D (U) = consecutive number of 0’s (1’s) Then
Example: 110011001100 Then E[PPAA] = 0.5, Var[PPAA] = 0
EE[[PPAA] = ] = EE[[UU]/(]/(EE[[DD]+]+EE[[UU])])
VarVar[[PPAA] = (] = (EE[[DD]]2 2 VarVar[[UU]] ++EE[[UU]]2 2 VarVar[[DD])])
((EE[[DD]+]+EE[[UU])])22
Universidad Polytecnica de
Catalunya
IliadHigh Performance General Purpose Computers
Topics: Uni & Multiprocessors Internet Processing Computer Design High Speed Switches
Current Contributions: Instruction level parallel machines (Superscalar, SCISM) New “Complex” Instructions New Designs of Arithmetic Processing New Switch Design
Future Directions: New Architectural paradigm
Universidad Polytecnica de
Catalunya
Complex Streamed Instructions
See PACT’01, EuroPar’01 Drawbacks of MMX-like extensions:
Multimedia (MM) register size architecturally visible and fixed. Ways out: add MM FUs and increase issue width
• expensive increase MM register size
• existing codes have to be recompile/rewritten
• not beneficial due to small sub-matrices overhead for converting between packed data types and alignment
Proposed solution: Complex Streamed Instructions (CSI) two-dimensional vector (stream) architecture, streams of arbitrary length stream is specified by set of stream control registers conversion between data types in hardware no loop control and address generation overhead
Universidad Polytecnica de
Catalunya
The Need for a Parallel Computation Model Parallel computing has not been very successful One reason: lack of a standard parallel computation model Properties that a suitable parallel computation model
should possess: Scalability Portability Predictability
Model proposed by Valiant (1990) Bulk-Synchronous Parallel (BSP) model
Universidad Polytecnica de
Catalunya
BSP Model BSP architectural model
set of p processors communicating by sending point-to-point messages
BSP programming model computations proceed in phases
(supersteps), separated by barrier synchronizations
BSP cost model superstep takes time
w + g · h + L
where
w: max. work
h: max. messages (h-relation)
g: bandwidth reciprocal
L: latency/synchronization cost
P
P
M
M
P
M
P
M
communicationnetwork
barrier sync
barrier sync
Universidad Polytecnica de
Catalunya
PUB Library Paderborn University BSP (PUB) library (IPDPS’99)
basics: SPMD no receive operation; barrier synchronization signifies end of all
communication operations only non-blocking communication primitives buffered and unbuffered communication message is placed in buffer associated with destination
processor from which it can be retrieved after the next barrier sync
Additional features: (non-blocking) collective communication primitives ability to partition the processors running different BSP computations on the same system (in
different threads)
Universidad Polytecnica de
Catalunya
PUB ExampleParallel Binary Multisearch
Search butterfly:
11 13 14 17 18 19 21 23
11 14 18 21
19191313
17 17 17 17
Proc 0 Proc 1 Proc 2 Proc 3
Localsearch
tree
11169
14
Universidad Polytecnica de
Catalunya
Parallel Binary Multisearch Using PUBvoid bin_search(int d, int m){ for (i=new_m=0; i<m; i++) if (query[i]<=gkey[d]&&inRight(d,me) || query[i]>gkey[d]&&inLeft(d,me)) bsp_send(&bsp,Opposite(d,me),&query[i],sizeof(int)); else query[new_m++] = query[i]; bsp_sync(&bsp);
for (i=0;i<bsp_nmsgs(&bsp);i++){ msg = bsp_getmsg(&bsp,i); query[new_m++] = (int)(*bspmsg_data(msg)); } if (d==0) local_search(new_m,query,n,key); else bin_search(d-1,new_m);}
Universidad Polytecnica de
Catalunya
Concluding Remarks Not discussed:
testing ISA extensions for sparse matrix computations computer arithmetic using single-electron technology reconfigurable processors network processors low power ...
For further information, please contact me ([email protected]) or see ce.et.tudelft.nl ce.et.tudelft.nl/~benj www.upb.de/~pub
Thank You