33
Automated Sharded MongoDB Deployment and Benchmarking for Big Data Analysis Gregor von Laszewski Mark McCombe [email protected] Indiana University Intelligent Systems Engineering Department [email protected]

Automated ShardedMongoDB Deployment and Benchmarking for ... · Automated ShardedMongoDB Deployment and Benchmarking for Big Data Analysis ... Mongo DB -Sharding •User ... •We

  • Upload
    dodat

  • View
    232

  • Download
    0

Embed Size (px)

Citation preview

AutomatedSharded MongoDBDeploymentandBenchmarking

forBigDataAnalysisGregorvonLaszewski

[email protected]

IntelligentSystemsEngineeringDepartment

[email protected]

Acknowledgement

• ThisstudyhasbeenconductedaspartoftheI524classwiththetopicBigDataandSoftwareProjects• Theclassusedthefollowingresources

• Studentscomputers• FutureSystems (DSC@IndianaUniversity)acontinuationoftheFutureGrid (NSF)• ChameleonCloud(NSF):ProjectCH-818664,KVM• Jetstream(NSF)

• Somestudentsalsoelectedtouse• AWS• Azure

• Allresourcesasfarasweknowwereprovidedtousforfree.

[email protected]

Outline

• Motivationfortheproject• IUeducatesdatascientists

• Sharded MongoDBdeployment

• Benchmarks

• UsageObservations

• Conclusion• Whywedidnotdoalargescalestudy…• Implicationforfutureclasses…

[email protected]

DataScientistAnalysis• Statistics• MachineLearning• Optimization

Programming• Python,JavaScript• DistributedComp.• CloudProgramming

Infrastructure• CloudComputing• DistributedSystems• DevOps

Visualization• BasicSkills• CustomizeforDataSet

DomainKnowledge

Communication• Paperwrite-up• OnlinePublication

• Requiresintegratedknowledgeinseveralkeyareas.Weuseaprojectthataddresses:

• Communication• Analysis• Visualization• Programming• Infrastructure• Domainknowledge

• EducationProgramsneedtoaddressallofthem

DataScientist

ShardedMongoDB

Deployments

ShardedMongoDB

Deployments

[email protected]

ContinuousImprovementvs.ContinuousDeploymentviaDevOps

design&modification

Cloudmeshscript

deployment

data

execution

verification

Continuousimprovement

• DevOpsisintegrated• Leadstoimprovementwhennotonlytargetingapplicationbutalsodeploymentenvironment.

CloudmeshShell– MakeBootingSimple

$emacs cloudmesh.yaml$cms defaultcloud=NAME$cms defaultimage=NAME$cmd defaultflavor=NAME$cms vm boot

$cms vm login

$cms vm delete

• cloudmesh.yaml

• Preparedefaults

• Boot

• Login

• Management …

CloudmeshShell– ManageHybridClouds

$cms aws boot$cms vm boot

$cms defaultcloud=chameleon$cms vm boot

$cms defaultcloud=IUCloud$cms vm boot

• BootCloudA

• BootCloudB

• BootCloudC

CloudmeshShell– CreateaHadoopCluster

$cmdefaultcloud=chameleon$cmclusterdefine- -count=10

- -flavor=m1.large$cmhadoop definespark

$cmhadoop sync#~30sec

$cmhadoop deploy#~7min

• Setcloud

• Definecluster

• Definehadoop Cluster

• Syncdefinitiontodb

• Deploythecluster

CloudmeshShell– CreateaHadoopCluster

$cmdefaultcloud=IUCloud$cmclusterdefine- -count=10

- -flavor=m1.large

$cmnist fingerprint #~30min

• Setcloud

• Definecluster

• RunNISTusecase

Additionalresources:https://github.com/cloudmesh/classes/blob/master/docs/source/notebooks/fingerprint_matching.ipynb

MongoDBFeatures

• DocumentorientedNoSQLdatabase• JSON-likedocuments• Specifiedthroughschemas

• Cross-platformcompatible• Freeopensource

• NoSQL=datathatismodeledinmeansotherthanthetabularrelationsusedinrelationaldatabases.

• Ad-hocqueries• Indexing• Replication• LoadBalancingwithSharding• FileStorage• Aggregation• ServerSideJavaScript• Cappedcollections

[email protected]

MongoDB- Sharding

• Userselectsshardkey thatdetermineshowthedatainacollectionwillbedistributed.• dataissplitintoranges(basedontheshardkey)• distributedacrossmultipleshards.

• (a)ashardisamasterwithoneormoreslaves.• (b)ortheshardkeycanbehashedtomaptoashardallowingevendatadistribution.

• MongoDBcanrunovermultipleservers,• balancingtheload• duplicatingdataforfaulttolerance

[email protected]

Architecture

[email protected]

BenchmarksonClouds

• Threecloudswereselectedfordeployment:• ChameleonCloud• Futuresystems• Jetstream

• Goal• Comparewithintheallocationlimitationsofaclassmultiplecloudperformancesbyvaryinganumberofparameters.

• ScriptedDeployments• Wedevelopedautomatedscripteddeploymentandbenchmarkingprocess• cloudnameispassedasaparameter• CustomizationforthedeploymentofMongoDBispassedviacommandline

[email protected]

CloudComparison

FutureSystems Chameleon JetstreamCPU XeonE5-2670 XeonX5550 HaswellE-2680Cores 1024 1008 7680Speed 2.66GHz 2.3GHz 2.5GHzRAM 3072GB 5376GB 40TBrStorage 335TB 2TB 2TBDeploymentyear 2010 Early 2015 OS2016

[email protected]

FlavorandOS

• Ubuntu16.04LTS(Xenial Xerus)operatingsystem.• Flavors– slightlydifferentbetweencloudsweusemostalike• m1.mediumChameleonCloud• m1.mediumFutureSystems• m1.smallwasusedonJetstream

• FlavorshavemoreresourcesthanChameleonandFutureSystems• Storageisloweronjetstream

Cloud Flavor VCPU RAM Size Chameleon m1.medium 2 4 40 FutureSystems m1.medium 2 4 40 Jetstream m1.small 2 4 20

[email protected]

Requirements• ResourceRequirements• 60users->VMhourswerelimited.

• CapabilityRequirements• creationofVMsandtheexecutionofourapplicationswithintheseVMs

• MonitoringRequirements• Monitoringandbenchmarkingwasconductedbyhandwithoutneedforspecializedservices.

• Newsoftwarecreated• improvedthecloudmesh clientsoftware[5][6][7],essentialtothesuccessoftheclass.

• PerformanceComparison• Wehaveconductedasignificantperformancecomparisonamongallclouds.

[email protected]

Benchmark

• Deploymenttimes• ComparingMongoDBversions• ComparingClouds

[email protected]

Deploymenttimes

[email protected]

Deployments

• DeploymentA• asimpledeploymentwithonlyoneofeachcomponentbeingcreated..

• DeploymentB• variationinconfig serversandshardsandanadditionalMongosinstance.

• DeploymentC• focusonhighperformance.• 9shardsnoreplication

Config Mongos Shards Replicas Seconds

A 1 1 1 1 330

B 3 2 3 3 1059

C 1 1 9 1 [email protected]

Variing otherDeploymenttimes

ConfigServers -c Mongos -m Shards -s Replicas -r Time in

Seconds

5 1 1 1 534

1 5 1 1 556

1 1 5 1 607

1 1 1 5 524

[email protected]

Data

• MajorLeagueBaseballPITCHf/xdataobtainedbyusingtheprogramBaseballonaStick(BBOS).

• BBOSisaPythonprogramcreatedby"willkoky"andhostedonsourceforge.netwhichextractsdatafrommlb.com andloadsitintoaMySQLdatabase.

• datawascapturedlocallytothedefaultMySQLdatabaseandthenextractedtoaCSV

• CSVfilewasimported• Contains5,508,014rowsand61columns.1.58GBinsizeuncompressed.

[email protected]

VersionComparison

[email protected]

VersionComparison:3.2vs3.4(ChameleonCloud)

FindCommand Mongoimport Command

[email protected]

VersionComparison:3.2vs3.4MapReduce(ChameleonCloud)

MapReduce

Result:nottoomanychanges

[email protected]

CloudComparisonSharding Test

[email protected]

Figure2:Mongoimport Command- Sharding Test

[email protected]

Figure1:FindCommand- Sharding Test

• Chameleon– Jetstream• Same

• FutureSystems• Acceptableresultswithhighernumberofshards

[email protected]

Figure3:MapReduce- Sharding Test

• Chameleon– jetstream• Same

• Futuresystems• Significantlyworse

[email protected]

Figure4:FindCommand- ReplicationTest

• Replication• Chameleoncloudseemstoperformslightlybetter• Futiresystems performssurprisinglywell

[email protected]

CloudComparisonReplicationTest

[email protected]

Figure5:Mongoimport Command- ReplicationTest

• Chameleon– jetstream• Same

• Futuresystems• Significantlyworse

[email protected]

Figure6:MapReduce- ReplicationTest

• Chameleon• SlightlybetterthanJetstream

• Futuresystems• Significantlyworse

[email protected]

Conclusion

• JetstreamandChameleonCloudareessentiallythesame.• InsomeinstancesChameleonCloudperformsslightlybetter• (disks/network…)

• AsexpectedFutureSystem isoldermachineandperformsnotaswell• ForsomequeriesFutureSystem issurprisinglygood

• Experimentswerelimitedbynumberofnodehoursfor60studentsinclass.• Afterclassisovernotimetorunonlargerexamples• Itsnotobviousforateacherwhentogivelargerallocationsforastudentthatperformswell.• Allocationprocessbroken• Futuresystems allocationprocessissuperior

[email protected]