End-to-end Data-flow Parallelism for Throughput ... · Disk Bottleneck " Step2: Effect of Parallel Streams on Memory-to-memory Transfers and CPU Utilization " Once disk bottleneck

End-to-end Data-flow Parallelism for Throughput Optimization in

High-speed Networks

Esma Yildirim Data Intensive Distributed Computing Laboratory

University at Buffalo (SUNY) Condor Week 2011

Motivation ì  Data grows larger hence the need for speed to transfer it

ì  Technology develops with the introduction of high-speed networks and complex computer architectures which are not fully utilized yet

ì  Still many questions are out in the uncertainty I can not receive the speed I am supposed to get

from the network

I have a 10G high-speed network and

supercomputers connecting. Why do I

still get under 1G throughput?

I can’t wait for a new protocol to replace the

current ones, why can’t I get high throughput with

what I have at hand?

OK, may be I am asking too much but I want to get optimal settings to

achieve maximal throughput

I want to get high throughput without

congesting the traffic too much. How can I

do it in the application level?

2

Introduction

Ø  Users of data-intensive applications need intelligent services and schedulers that will provide models and strategies to optimize their data transfer jobs

Ø  Goals: Ø  Maximize throughput Ø  Minimize model overhead Ø  Do not cause contention among users Ø  Use minimum number of end-system resources

3

Introduction Ø  Current optical technology supports 100 G transport hence, the

utilization of network brings a challenge to the middleware to provide faster data transfer speeds

Ø  Achieving multiple Gbps throughput have become a burden over TCP-based networks Ø  Parallel streams can solve the problem of network utilization

inefficiency of TCP Ø  Finding the optimal number of streams is a challenging task

Ø  With faster networks end-systems have become the major source of bottleneck Ø  CPU, NIC and Disk Bottleneck

Ø  We provide models to decide on the optimal number of parallelism and CPU/disk stripes

4

Outline

Ø  Stork Overview Ø  End-system Bottlenecks Ø  End-to-end Data-flow Parallelism Ø Optimization Algorithm Ø  Conclusions and Future Work

5

Stork Data Scheduler

Ø  Implements state-of-the art models and algorithms for data scheduling and optimization

Ø  Started as part of the Condor project as PhD thesis of Dr. Tevfik Kosar

Ø  Currently developed at University at Buffalo and funded by NSF

Ø Heavily uses some Condor libraries such as ClassAds and DaemonCore

6

Stork Data Scheduler (cont.)

Ø  Stork v.2.0 is available with enhanced features Ø  http://www.storkproject.org

Ø  Supports more than 20 platforms (mostly Linux flavors) Ø Windows and Azure Cloud support planned soon Ø  The most recent enhancement:

Ø  Throughput Estimation and Optimization Service

7

End-to-end Data Transfer

Ø  Method to improve the end-to-end data transfer throughput

Ø  Application-level Data Flow Parallelism Ø  Network level parallelism (parallel streams) Ø  Disk/CPU level parallelism (stripes)

8

Network Bottleneck " Step1: Effect of Parallel Streams on Disk-to-disk Transfers Ø  Parallel streams can improve the data throughput but only to a

certain extent Ø  Disk speed presents a major limitation. Ø  Parallel streams may have an adverse effect if the disk speed

upper limit is already reached

200

400

600

800

1000

1 2 4 8 16

Mbp

s

number of streams

a) LONI-GridFTP-disk

Throughput

200 400 600 800

1000 1200

1 2 4 8 16

Mbp

s

number of streams

b) Teragrid-GridFTP-disk

Throughput

100

200

300

400

1 2 4 8 16M

bps

number of streams

c) Inter-node-GridFTP-disk

Throughput

9

Disk Bottleneck " Step2: Effect of Parallel Streams on Memory-to-memory Transfers and CPU Utilization Ø  Once disk bottleneck is eliminated, parallel streams improve

the throughput dramatically Ø  Throughput either becomes stable or falls down after reaching

its peak due to network or end-system limitations. Ex:The network interface card limit(10G) could not be reached (e.g.7.5Gbps-internode)

1000 2000

4000

8000

1 4 8 16 32 64 128

Mbp

s

number of streams

a) LONI-GridFTP

Throughput

1000

2000

3000

4000

5000

6000

1 4 8 16 32 64

Mbp

s

number of streams

b) Teragrid-GridFTP

Throughput

1000 2000

4000

8000

1 4 8 16 32M

bps

number of streams

c) Inter-node-GridFTP

Throughput

10

CPU Bottleneck

" Step3: Effect of Striping and Removal of CPU Bottleneck Ø  Striped transfers improves the throughput dramatically Ø  Network card limit is reached for inter-node transfers(9Gbps)

1000 2000

4000

8000

16 32 64

Mbp

s

number of streams

a) LONI-GridFTP

1 stripe2 stripes4 stripes

1000 2000 3000 4000 5000 6000

8 16 32

Mbp

s

number of streams

b) Teragrid-GridFTP


1000 2000

4000

8000 9000

1 2 4M

bps

number of streams

c) Inter-node-GridFTP


11

Prediction of Optimal Parallel Stream Number Ø  Throughput formulation : Newton’s Iteration Model

Ø  a’ , b’ and c’ are three unknowns to be solved hence 3

throughput measurements of different parallelism level (n) are needed

Ø  Sampling strategy: Ø  Exponentially increasing parallelism levels

Ø  Choose points not close to each other Ø  Select points that are power of 2: 1, 2, 4, 8, … ,

2k Ø  Stop when the throughput starts to decrease or

increase very slowly comparing to the previous level

Ø  Selection of 3 data points Ø  From the available sampling points

Ø  For every 3-point combination, calculate the predicted throughput curve

Ø  Find the distance between the actual and predicted throughput curve

Ø  Choose the combination with the minimum distance

0

10

20

30

40

1 5 10 15 20 25 30

Thro

ughput (M

bps)

number of parallel streams

LAN-WAN Newton’s Method Model

GridFTPNewton 1_8_16

Dinda 1_16

0

10

20

30

40

1 5 10 15 20 25 30

Thro

ughput (M

bps)

number of parallel streams

LAN-WAN Full Second Order Model

GridFTPFull 1_8_16Dinda 1_16

!

Thn =n

a'nc ' + b'

12

Flow Model of End-to-end Throughput

Ø  CPU nodes are considered as nodes of a maximum flow problem Ø  Memory-to-memory transfers are simulated with dummy source

and sink nodes Ø  The capacities of disk and network is found by applying parallel

stream model by taking into consideration of resource capacities (NIC & CPU)

13

Flow Model of End-to-end Throughput

Ø  Convert the end-system and network capacities into a flow problem

Ø  Goal: Provide maximal possible data transfer throughput given real-time traffic (maximize(Th)) Ø  Number of streams per stripe (Nsi) Ø  Number of stripes per node (Sx) Ø  Number of nodes (Nn)

14

Ø  Assumptions Ø  Parameters not given and found by the model:

Ø  Available network capacity (Unetwork) Ø  Available disk system capacity (Udisk)

Ø  Parameters given Ø  CPU capacity (100% assuming they are idle

at the beginning of the transfer) (UCPU) Ø  NIC capacity (UNIC) Ø  Number of available nodes (Navail)

Flow Model of End-to-end Throughput Ø  Variables:

Ø  Uij = Total capacity of each arc from node i to node j Ø  Uf= Maximal (optimal) capacity of each flow (stripe) Ø  Nopt = Number of streams for Uf Ø  Xij = Total amount of flow passing i −> j Ø  Xfk = Amount of each flow (stripe) Ø  NSi= Number of streams to be used for Xfkij Ø  Sxij= Number of stripes passing i− > j Ø  Nn = Number of nodes

Ø  Inequalities: Ø  There is a high positive correlation between the throughput of parallel streams and CPU

utilization Ø  The linear relation between CPU utilization and Throughput is presented as :

Ø  a and b variables are solved by using the sampling throughput and CPU utilization

measurements in regression of method of least squares

15

!

0 " Xij "Uij

!

0 " X fk "Uf

!

Ucpu = a + b " Th

OPTB Algorithm for Homogeneous Resources

Ø  This algorithm finds the best parallelism values for maximal throughput in homogeneous resources

Ø  Input parameters: Ø  A set of sampling values from sampling algorithm (ThN) Ø  Destination CPU, NIC capacities (UCPU, UNIC) Ø  Available number of nodes (Navail)

Ø  Output: Ø  Number of streams per stripe (Nsi) Ø  Number of stripes per node (Sx) Ø  Number of nodes (Nn)

Ø  Assumes both source and destination nodes are idle

16

OPTB-Application Case Study

17

9Gbps

Ø  Systems: Oliver, Eric

Ø  Network: LONI (Local Area)

Ø  Processor: 4 cores

Ø  Network Interface: 10GigE Ethernet

Ø  Transfer: Disk-to-disk (Lustre)

Ø  Available number of nodes: 2


18

9Gbps

Ø  ThNsi=903.41Mbps p=1

Ø  ThNsi=954.84 Mbps p=2



Ø  Nopt=3 Nsi=2

!"#$%&'(!"#$%&)( !"#$%&*( !"#$%&+(

,(

),,(

+,,(

-,,(

.,,(

',,,(

'),,(

'( )( *( +( /( -( 0( .(

!"#$%&'()'*+&%,#*'

,-./01%&23&0425672*0!8/%'*+&09%''

1234562$57(

83&9:;7&9<'<)<.(

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

1

1

1

1 1

1

1

1 1

1

1

1 1

1

1

1

1

1

0

0

0

0 0

0

0

0

0

0 0

0

2

2 2 2

2

2 2

2

2 4

4 4 4

4

4 4

4

4 8

8 8 8

8

8 8

8

8


19

9Gbps

Ø  Sx=2 ThSx1,2,2=1638.48

Ø  Sx=4 ThSx1,4,2=3527.23

Ø  Sx=8 ThSx2,4,2=4229.33

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

Nsi= Sxij=

2

1

2

1 2

1 1 2

1

2

1 2

1

2

1

2

1

0

0

0

0 0

0

0

0

0

0 0

0

2

2

2 2 2

2

2 2

2

2 4

4 4 4

4

4 4

4

4 8

2 4

2 4 2

4

8

2 4

2 4

2 4

8

!"#$%&'(

!"#$%&)(

!"#$%&*(

!"#$%&+(

,(-,,(

',,,('-,,(),,,()-,,(*,,,(*-,,(+,,,(+-,,(-,,,(

'./'!01/)!0#

(

'./)!01/)!0#

(

'./*!01/)!0#

(

'./+!01/)!0#

(

)./-!01/)!0#

(

)./2!01/)!0#

(

)./3!01/)!0#

(

)./4!01/)!0#

(

!"#

$%

&'(")*%+,%$-*.#)$%

/012.3)*45*./46784('29%$-*.#)%

5617896$80(

:1&;<=0&;>'>+>4(

OPTB-LONI-memory-to-memory-10G

20

!"#$%&'(

!"#$%&)(

!"#$%&*( !"#$%&+(

,(

',,,(

),,,(

*,,,(

+,,,(

-,,,(

.,,,(

/,,,(

'( )( *( +( -( .( /( 0(

!"#

$%

&'(")*%+,%$-*).($%

./0123)*45*2647894()(+*:4$2&;1)%$-*2#)%

1234562$57(

83&9:;7&9<'<+<0(

!"

#"

$!"

$#"

%!"

%#"

&!"

$" %" '" ("

!"#$"%

&'(")

%*+,"#)-.)/&#"'+/)

,012%(3")/&#24"56$-#"/57!8)8932:'9-%)

)*+,-.*/0"

!"#$%&'(

!"#$%&)(

!"#$%&*(

+(

'+++(

)+++(

*+++(

,+++(

-+++(

.+++(

/+++(

0+++(

1+++(

'++++(

'23'!453,!4#( '23)!453,!4#( )23*!453,!4#( )23,!453,!4#(

!"#

$%

&'(")*%+,%$-*.#)$%

/012.3)*45*./46784()(+*94('2:%$-*.#)%

67589:7$94(

;5&<=>4&<?'?)?,(

!"#"

$!"$#"%!"%#"&!"&#"'!"'#"#!"

$()$*+,)'*+-" $()%*+,)'*+-"

!"#$"%

&'(")

%*+,"#)-.)/&#01"/)

234*56)/&#01"78$-#"/79!:):650;'6-%)

./0123/4("

OPTB-LONI-memory-to-memory-1G-Algorithm Overhead

100 200

400

600

20 30 40 60

sec

data size(GB)

a) Oliver-Eric-1G NIC-1GB sampling size

Sampling overheadTotal optimized time

Optimized time with historyNon-optimized time

100 200

400

600

20 30 40 60

sec

data size(GB)

b) Oliver-Eric-1G NIC-2GB sampling size



50

100

150

200

20 30 40 60

sec

data size(GB)

c) Oliver-Eric-10G NIC-1GB sampling size



50

100

150

200

20 30 40 60

sec

data size(GB)

d) Oliver-Eric-10G NIC-2GB sampling size



21

Conclusions

Ø We have achieved end-to-end data transfer throughput optimization with data flow parallelism Ø Network level parallelism

Ø Parallel streams

Ø  End-system parallelism Ø CPU/Disk striping

Ø  At both levels we have developed models that predict best combination of stream and stripe numbers

22

Future work

Ø We have focused on TCP and GridFTP protocols and we would like to adjust our models for other protocols

Ø We have tested these models in 10G network and we plan to test it using a faster network

Ø We would like to increase the heterogeneity among the nodes in source or destination

23

Acknowledgements Ø  This project is in part sponsored by the National

Science Foundation under award numbers Ø  CNS-1131889 (CAREER) – Research & Theory Ø OCI-0926701 (Stork) – SW Design & Implementation Ø  CCF-1115805 (CiC) – Stork for Windows Azure

Ø We also would like to thank to Dr. Miron Livny and the Condor Team for their continuous support to the Stork project.

Ø  http://www.storkproject.org

24

Documents

End-to-end Data-flow Parallelism for Throughput ... · Disk Bottleneck " Step2: Effect of Parallel Streams on Memory-to-memory Transfers and CPU Utilization " Once disk bottleneck