26
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo

Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

  • Upload
    sammy

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer. Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo. Background. - PowerPoint PPT Presentation

Citation preview

Page 1: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation

on Shared Memory Parallel Computer

Yoshihiro Oyama, Kenjiro Taura,

Toshio Endo, Akinori Yonezawa

Department of Information Science, Faculty of Science,

University of Tokyo

Page 2: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Background

“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated

Languages with fine-grain threads• promising approach to handle the complexity

Page 3: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Motivation

Q: Are fine-grain threads really effective?

• Easy to describe irregular parallelism?• Scalable?• Fast?

Case studies to answer the Q are few

Many sophisticated designs and implementation techniqueshave been proposed so far, but

Page 4: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Goal

Case study to better understandthe effectiveness of fine-grain threads

C + Solaris threads

VS.

• program description cost• speed on 1 PE• scalability on 64PE SMP

in terms of

our language Schematic

approach w/o fine-grain threads

approach withfine-grain threads

Page 5: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Overview

Applications ( RNA & CKY )

Solutions without fine-grain threads

Solutions with fine-grain threads

Performance evaluation

Page 6: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Case Study 1: RNA- protein secondary structure prediction -

Algorithm simple node traversal + pruning

finding a path• satisfying certain condition• with largest weight

unbalanced tree

Page 7: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Case Study 2: CKY- context-free grammar parser -

calculation of matrix elements

depends on all s

She is a girl whose mother is a teacher.

calculation time significantlyvaries from element to element

actual size 100≒

Page 8: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
Page 9: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

To create a threadfor each node large overhead

communicationwith memory

Task Pool

P P P

Solution without Fine-grain Threads(RNA)

Page 10: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

calculating 1 element→ 0 ~ 200 synchronization

P P P

decision strategy?• trial & error• prediction

Solution without Fine-grain Threads(CKY )

how to implement?• small delay → simple spin• large delay → block wait

Page 11: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
Page 12: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]

(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))

thread creation

synchronization

channel

Language with Fine-grain Threads

Page 13: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]

PE A PE B

future future

future

future

future future

future

future

future

stac

k future

future

future

Page 14: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Synchronization on Register

PE A PE B

• StackThreads [Taura 97]

register

memory

register

register

register register

registerregister

register

memory

register

memory

Page 15: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Synchronization by Code Duplication

heuristics to decide which to duplicate+

if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}

cont(c, v){ }

work A

work B ver. 1;

work B ver. 2;

work A work B(touch r)

simple spin

block wait

Page 16: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
Page 17: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

What description can be omittedin Schematic? Management of fine-grain tasks

Synchronization details

future ⇔ manipulation of task pool + load balance

touch ⇔ manipulation of comm. medium + aggressive optimizations

SchematicC + thread

Page 18: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Codes for Parallel Execution

int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}

C

(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))

Schematic

whole: 1566 lines whole: 453 lines

parallel: 537 lines (34 %)

parallel: 29 lines (6.4 %)

for parallelexecution

RNA

Page 19: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Condition) Sun Ultra Enterprise 10000

(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)

GC time not included Runtime type check omitted

Page 20: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Sequential)

0

1

2

3

RNA CKY

norm

aliz

ed e

laps

ed t

ime

C Schematic

Page 21: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Parallel)

0

10

20

30

40

50

0 10 20 30 40 50 60# of PEs

spee

dup

C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)

Page 22: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Related Work

ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on

• namespace management

• data locality

• object-consistency model

Page 23: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Conclusion

We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization

We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C

Page 24: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Other Applications 1/2)

14.7

0

1

2

3

4

Fib Tak Qsort Knapsack Grobner SPLASH2

norm

aliz

ed e

laps

ed t

ime

C Schematic

Page 25: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Performance Evaluation(Other Applications 2/2)

0

10

20

30

40

50

0 10 20 30 40 50 60

# of PEs

spee

dup

Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2

Page 26: Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

Identifying Overheads

0

200

400

600

800

1000

normal no poll no GCcheck

stolentagopt.

flagcheck

usesmalltag

globalvaropt.

C

norm

aliz

ed e

laps

ed t

ime