A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services

A Flexible GridFTP Client for Implementation of

Intelligent Cloud Data Scheduling Services

Esma YildirimDepartment of Computer EngineeringFatih UniversityIstanbul, Turkey

DATACLOUD 2013

OutlineData Scheduling Services in the CloudFile Transfer Scheduling Problem

HistoryImplementation Details of the Client Example AlgorithmsAmazon EC2 ExperimentsConclusions

Cloud Data Scheduling ServicesData Clouds strive for novel services for

management, analysis, access and scheduling of Big Data

Application level protocols providing high performance in high speed networks is an integral part of data scheduling services

GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)

Bottlenecks in Data Scheduling ServicesData is large, diverse and complexTransferring large datasets faces many

bottlenecksTransport protocol’s under utilization of

networkEnd-system limitations (e.g. CPU, NIC and

disk speed)Dataset characteristics

Many short duration transfers Connection startup and tear down overhead

Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency

Application in Data Scheduling ServicesSetting optimal parameters for

different datasets is a challenging task Data Scheduling Services sets static

values based on experiencesProvided tools do not comply with

dynamic intelligent algorithms that might change settings on the fly

Goals of the Flexible ClientFlexibility to scalable data scheduling

algorithmsOn the fly changes to the optimization

parametersReshaping the dataset characteristics

File Transfer Scheduling ProblemLies at the origin of the data scheduling

servicesDates back to 1980sEarliest approaches: List scheduling

Sort the transfers based on size, bandwidth of the path or duration of the transfer

Near-optimal solutionInteger programming – not feasible to

implement

File Transfer Scheduling Problem Scalable approaches:

Transferring from multiple replicas Divided datasets sent over different paths to make use of

additional network bandwidth

Adaptive approaches Divide files into multiple portions to send over parallel streams Divide dataset into multiple portions and send at the same time Adaptively change level of concurrency or parallelism based on

network throughput Optimization algorithms

Find optimal settings via modeling and set the optimal parameters once and for all

File Transfer Scheduling ProblemModern Day Data Scheduling Service

Example Globus Online

Hosted SaaS Statically set pipelining, concurrency and

parallelism Stork

Multi-protocol support Finds optimal parallelism level based on

modeling Static job concurrency

Ideal Client Interface

Allow dataset transfers to beEnqueued, dequeuedSorted based on a propertyDivided, combined into chunksGrouped by source-destination pathsDone from multiple replicas

Implementation Details

Lacks of globus-url-copyDoes not let even static setting of

pipelining, uses its own default value invisible to the userglobus-url-copy -pp -p 5 -cc 4 src url dest url

A directory of files can not be divided and set different optimization parametersFilelist option does help but it can not apply

pipelining on the list as the developers indicates

globus-url-copy -pp -p 5 -cc 4 -f filelist.txt


File data structure propertiesFile size: used to construct data chunks

based on total size, throughput calculation, transfer duration calculation

Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location

File name: Necessary to reconstruct full paths


Listing the files for a given pathContacts the GridFTP server Pulls information about the files in the

given pathProvides a list of file data structures

including the number of filesMakes it easier to divide, combine, sort ,

enqueue and dequeue on a list of files


Performing the actual transferSets the optimization parameters on a

list of files returned by the list function and manipulated by different algorithms

For a data chunk it sets the parallel stream, concurrency and pipelining value

Example Algorithms 1: Adaptive ConcurrencyTakes a file list structure returned by the list

function as inputDivides the file list into chunks based on the

number of files in a chunkStarting with concurrency level of 1 , transfer each

chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer

If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer

Example Algorithms 1: Adaptive Concurrency

Example Algorithm 2: Optimal Pipelining Mean-based algorithm to construct clusters of files

with different optimal pipelining levelsCalculates optimal pipelining level by dividing BDP

into mean file size of the chunkDataset is recursively divided by the mean file size

index as long as the following conditions are met:A chunk can only be divided further as long as its

pipelining level is different than its parent chunkA chunk can not be less than a preset minimum

chunk sizeOptimal pipelining level for a chunk cannot be greater

than a preset maximum pipelining level

Example Algorithm 2-a: Optimal Pipelining

Example Algorithm 2-b: Optimal Pipelining and Concurrency

After the recursive division of chunks, ppopt is set for each chunk

Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided

Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down

The rest of the chunks are transferred with the optimal cc level

Example Algorithm 2-b: Optimal Pipelining and Concurrency

Amazon EC2 ExperimentsLarge nodes with 2vCPUs, 8GB storage, 7.5

GB memory and moderate network performance

50ms artificial delayGlobus Provision is used to automatic setup

of servers Datasets comprise of many number of small

files (most difficult optimization case)5000 1MB files1000 random size files in range 1Byte to 10MB

Amazon EC2 Experiments: 5000 1MB filesBaseline performance: Default

pipelining+data channel cachingThroughput achieved is higher than

baseline for majority of cases

Amazon EC2 Experiments: 1000 random size files

ConclusionsThe flexible GridFTP client has the

ability to comply with different natured data scheduling algorithms

Adaptive and optimization algorithms easily sort, divide and combine datasets

Possibility to implement intelligent cloud scheduling services in an easier way

Questions?

Documents

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services