Twister4Azure - Iterative MapReduce for Azure Cloud

Twister4Azure : Iterative MapReduce for Azure Cloud

CCA 2011April 12 – 13, 2011

Thilina Gunarathne, Judy Qiu, Geoffrey Fox{tgunarat, xqiu,gcf}@indiana.edu

MapReduceRoles for Azure

Familiar MapReduce programming model Built using highly-available and scalable Azure cloud

services Co-exist with eventual consistency & high latency of

cloud services Decentralized control

No single point of failure. Supports dynamically scaling up and down of the

compute resources. MapReduce fault tolerance

MapReduceRoles for Azure

Twister for Azure

Reduce

Reduce

MergeAdd

Iteration? No

Map Combine

Map Combine

Map Combine

Data Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues as well

as using a bulletin board (special table)

Twister for Azure

Map 1

Map 2

Map n

Map Workers

Red 1

Red 2

Red n

Reduce Workers

In Memory Data Cache

Task Monitoring

Role Monitoring

Worker Role

MapID ……. Status

Map Task Table

MapID ……. Status

Job Bulleting Board

Scheduling Queue

Performance – Kmeans Clustering

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

200

400

600

800

1000

1200

1400

1600

8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M

Rela

tive

Par

alle

l Effi

cien

cy

Tim

e (s

)

Num Instances X Num Data Points

Relative ParallelEfficiencyTime(s)

Performance with/without data caching. Speedup gained using data cache

Scaling speedup Increasing number of iterations

Performance Comparisons

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

128 228 328 428 528 628 728

Par

alle

l Effi

cien

cy

Number of Query Files

Twister4Azure

Hadoop-Blast

DryadLINQ-Blast

0

500

1000

1500

2000

2500

3000

Adju

sted

Time (

s)

Num. of Cores * Num. of Blocks

Twister4Azure

Amazon EMR

Apache Hadoop

50%55%60%65%70%75%80%85%90%95%

100%

Para

llel E

ffici

ency

Num. of Cores * Num. of Files

Twister4Azure

Amazon EMR

Apache Hadoop

BLAST Sequence Search

Cap3 Sequence Assembly

Smith Watermann Sequence Alignment

Conclusion

Enables users to easily and efficiently perform large scale iterative data analysis and scientific computations on Azure cloud.

Utilizes a novel hybrid scheduling mechanism to provide the caching of static data across iterations.

Utilize cloud infrastructure services effectively to deliver robust and efficient applications.

http://salsahpc.indiana.edu/twister4azure

Technology

Twister4Azure - Iterative MapReduce for Azure Cloud