Upload
conway
View
50
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services. DATACLOUD 2013. Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey. Outline. Data Scheduling Services in the Cloud File Transfer Scheduling Problem History - PowerPoint PPT Presentation
Citation preview
A Flexible GridFTP Client for Implementation of
Intelligent Cloud Data Scheduling Services
Esma YildirimDepartment of Computer EngineeringFatih UniversityIstanbul, Turkey
DATACLOUD 2013
OutlineData Scheduling Services in the CloudFile Transfer Scheduling Problem
HistoryImplementation Details of the Client Example AlgorithmsAmazon EC2 ExperimentsConclusions
Cloud Data Scheduling ServicesData Clouds strive for novel services for
management, analysis, access and scheduling of Big Data
Application level protocols providing high performance in high speed networks is an integral part of data scheduling services
GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)
Bottlenecks in Data Scheduling ServicesData is large, diverse and complexTransferring large datasets faces many
bottlenecksTransport protocol’s under utilization of
networkEnd-system limitations (e.g. CPU, NIC and
disk speed)Dataset characteristics
Many short duration transfers Connection startup and tear down overhead
Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency
Application in Data Scheduling ServicesSetting optimal parameters for
different datasets is a challenging task Data Scheduling Services sets static
values based on experiencesProvided tools do not comply with
dynamic intelligent algorithms that might change settings on the fly
Goals of the Flexible ClientFlexibility to scalable data scheduling
algorithmsOn the fly changes to the optimization
parametersReshaping the dataset characteristics
File Transfer Scheduling ProblemLies at the origin of the data scheduling
servicesDates back to 1980sEarliest approaches: List scheduling
Sort the transfers based on size, bandwidth of the path or duration of the transfer
Near-optimal solutionInteger programming – not feasible to
implement
File Transfer Scheduling Problem Scalable approaches:
Transferring from multiple replicas Divided datasets sent over different paths to make use of
additional network bandwidth
Adaptive approaches Divide files into multiple portions to send over parallel streams Divide dataset into multiple portions and send at the same time Adaptively change level of concurrency or parallelism based on
network throughput Optimization algorithms
Find optimal settings via modeling and set the optimal parameters once and for all
File Transfer Scheduling ProblemModern Day Data Scheduling Service
Example Globus Online
Hosted SaaS Statically set pipelining, concurrency and
parallelism Stork
Multi-protocol support Finds optimal parallelism level based on
modeling Static job concurrency
Ideal Client Interface
Allow dataset transfers to beEnqueued, dequeuedSorted based on a propertyDivided, combined into chunksGrouped by source-destination pathsDone from multiple replicas
Implementation Details
Lacks of globus-url-copyDoes not let even static setting of
pipelining, uses its own default value invisible to the userglobus-url-copy -pp -p 5 -cc 4 src url dest url
A directory of files can not be divided and set different optimization parametersFilelist option does help but it can not apply
pipelining on the list as the developers indicates
globus-url-copy -pp -p 5 -cc 4 -f filelist.txt
Implementation Details
File data structure propertiesFile size: used to construct data chunks
based on total size, throughput calculation, transfer duration calculation
Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location
File name: Necessary to reconstruct full paths
Implementation Details
Listing the files for a given pathContacts the GridFTP server Pulls information about the files in the
given pathProvides a list of file data structures
including the number of filesMakes it easier to divide, combine, sort ,
enqueue and dequeue on a list of files
Implementation Details
Performing the actual transferSets the optimization parameters on a
list of files returned by the list function and manipulated by different algorithms
For a data chunk it sets the parallel stream, concurrency and pipelining value
Example Algorithms 1: Adaptive ConcurrencyTakes a file list structure returned by the list
function as inputDivides the file list into chunks based on the
number of files in a chunkStarting with concurrency level of 1 , transfer each
chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer
If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer
Example Algorithms 1: Adaptive Concurrency
Example Algorithm 2: Optimal Pipelining Mean-based algorithm to construct clusters of files
with different optimal pipelining levelsCalculates optimal pipelining level by dividing BDP
into mean file size of the chunkDataset is recursively divided by the mean file size
index as long as the following conditions are met:A chunk can only be divided further as long as its
pipelining level is different than its parent chunkA chunk can not be less than a preset minimum
chunk sizeOptimal pipelining level for a chunk cannot be greater
than a preset maximum pipelining level
Example Algorithm 2-a: Optimal Pipelining
Example Algorithm 2-b: Optimal Pipelining and Concurrency
After the recursive division of chunks, ppopt is set for each chunk
Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided
Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down
The rest of the chunks are transferred with the optimal cc level
Example Algorithm 2-b: Optimal Pipelining and Concurrency
Amazon EC2 ExperimentsLarge nodes with 2vCPUs, 8GB storage, 7.5
GB memory and moderate network performance
50ms artificial delayGlobus Provision is used to automatic setup
of servers Datasets comprise of many number of small
files (most difficult optimization case)5000 1MB files1000 random size files in range 1Byte to 10MB
Amazon EC2 Experiments: 5000 1MB filesBaseline performance: Default
pipelining+data channel cachingThroughput achieved is higher than
baseline for majority of cases
Amazon EC2 Experiments: 1000 random size files
ConclusionsThe flexible GridFTP client has the
ability to comply with different natured data scheduling algorithms
Adaptive and optimization algorithms easily sort, divide and combine datasets
Possibility to implement intelligent cloud scheduling services in an easier way
Questions?