47
Storage Locality and counting words

Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

StorageLocalityandcountingwords

Page 2: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Countingthewordsinalongtext

• 218718 wordslong• 17150 differentwords• Howmanytimesdoeseachwordoccur?

Page 3: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Task:countthenumberofoccurrencesofeachwordinverylongtext.• Input: CallmeIshmael.Someyearsago--nevermindhowlongprecisely—havinglittleornomoneyinmypurse,andnothingparticulartointerestmeonshore,IthoughtIwouldsail…• Mobydick:about1.3MByte• Desiredoutput:• Call:354• Me:53423• Ismael:1322• ….

Page 4: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Simplesolution• Iterateoverwords.Updatecounterforcurrentword.

Page 5: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Letsuseasortedlist

5

8

Page 6: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Sort-basedsolution

Sortwords

Iterateoversortedlist

Countoccurrencesofsameword

Switchonwordboundry

Page 7: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Summary

• Sortingimprovesmemorylocalityforwordcounting• Improvedmemorylocalityreducesrun-time• Why?Becausecomputermemoryisorganizedinahierarchy.

Page 8: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

StorageLatencySmallandFastvs.LargeandSlow

Page 9: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CPU

A

C

B

= *

A

C

B

Page 10: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

TIME

Latencies

1. ReadA2. ReadB3. C=A*B4. WriteC

Withbigdata,mostofthelatencyismemorylatency(1,2,4),notcomputation(3)

• MainMemory(RAM)

• Spinningdisk

• Remotecomputer

Latency2

Latency1

Latency3

Latency4

TotalLaten

cy

StorageTypes

Page 11: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CPU

A

C

B

= *

A

C

B

NON-LOCALSTORAGEACCESS

middle

50…00

50…01

50…02

…….

…….

…….

Page 12: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

middle50…

0050…

0150…

02…….

…….

…….

CPUA=0Fori inrange(100000):

A+=X[i]

Fori inrange(100000):A-=Y[i]

X

Y

LOCALSTORAGEACCESS

Page 13: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Summary

• Themajorsourceoflatencyindataanalysisisreadingandwritingtostorage• Differenttypesofstorageofferdifferentlatency,capacityandprice.• Bigdataanalyticsrevolvesaroundmethodsfororganizingstorageandcomputationinwaysthatmaximizespeedwhileminimizingcost.• Next,CachesandthememoryHierarchy.

Page 14: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CachesandtheMemoryHierarchy

Page 15: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Latency,sizeandpriceofcomputermemory

Givenabudget,weneedtotradeoff

$10:Fast& Small $10:Slow& Large

Page 16: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Cache:Thebasicidea

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

12 67

50 51

52 53

32 33

cpu

Memory

Cache

Fast&Small

Slow& Large

Page 17: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CacheHit

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

12 67

50 51

52 53

32 33

cpu

Memory

CacheCacheHit

Page 18: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CacheMiss

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

12 67

50 51

52 53

32 33

cpu

Memory

Cache

Page 19: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CacheMissService:1)Choosebytetodrop

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

12 67

50 51

52 53

32 33

cpu

Memory

Cache

67

Page 20: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CacheMissService:2)writeback

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

12

50 51

52 53

32 33

cpu

Memory

Cache

67

Page 21: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CacheMissService:3)ReadIn

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

12

50 51

52 53

32 33

cpu

Memory

Cache

47

Page 22: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

AccessLocality

• ThecacheiseffectiveIfmostaccessesarehits.• CacheHitRateishigh.

• TemporalLocality:Multipleaccessestosame addresswithinashorttimeperiod

Page 23: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Spatiallocality

• SpatialLocality:Multipleaccessestoclose-togetheraddressesinshorttimeperiod.• Thedifferencebetweentwosums.• Countingwordsbysorting

• Benefitingfromspatiallocality• MemoryispartitionedintoBlocks/Linesratherthansinglebytes.• Movingablockofmemorytakesmuchlesstimethanmovingeachbyteindividually.• Memorylocationsthatareclosetoeachotherarelikelytofallinthesameblock.• Resultinginmorecachehits.

Page 24: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Cache:Lines/Blocks

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 52 53 54 55 56 57 58 59

60 61 62 63 64 65 66 67 68 69

70 71 72 73 74 75 76 77 78 79

50 51

52 53

32 33

34 35

cpu

Memory

Cache

SupportsSpatiallocality

Page 25: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Unsortedwordcount/poorlocality

• ConsiderthememoryaccesstothedictionaryD:• Countwithoutsort:D[the]=12332,…,D[but]=943,………,D[vernacular]=10,……….....,D[for]=..• Temporallocalityforverycommonwordslike“the”• Nospatiallocality

Page 26: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

sortedwordcount/goodlocality

EntriestoDareaddedoneatatime.1. D[lines]=332. D[lines]=33,D[lingered]=53. D[lines]=33,D[lingered]=5,D[lingering]=8Assumingnewentriesareaddedattheend,thisgivesspatiallocality.Spatiallocalitymakescoderunfaster

Page 27: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Summary

• CachingreducesstoragelatencybybringingrelevantdataclosetotheCPU.• Thisrequiresthatcodeexhibitsaccesslocality:• Temporallocality:Accessingthesamelocationmultipletimes• Spatiallocality:Accessingneighboringlocations.

Page 28: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

ThememoryHierarchy

Page 29: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

TheMemoryHierarchy

• Realsystemshaveaseverallevelsstoragetypes:• Topofhierarchy:SmallandfaststorageclosetoCPU• BottomofHierarchy:LargeandslowstoragefurtherfromCPU

• Cachingisusedtotransferdatabetweendifferentlevelsofthehierarchy.• Programmer/compilerisoblivious:• Thehardwareprovidesanabstraction:memorylookslikelikeasinglelargearray.

• Butperformancedependsonprogram’saccesspattern.

Page 30: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

TheMemoryHierarchy

Page 31: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Computerclustersextendthememoryhierarchy

• Adataprocessingclusterissimplymanycomputerslinkedthroughanethernet connection.• Storageisshared• Locality:Datatoresideonthecomputerthatwilluseit.• “Caching”isreplacedby“Shuffling”• AbstractionissparkRDD.

Page 32: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

CPU(Registers)

L1Cache

L2Cache

L3Cache

MainMemory

DiskStorage

LocalAreaNetwork

Size(bytes) 1KB 64KB 256KB 4MB 4-16GB 4-16TB 16TB –10PB

Latency 300ps 1ns 5ns 20ns 100ns 2-10ms 2-10ms

Block size 64B 64B 64B 64B 32KB 64KB 1.5-64KB

Sizesandlatenciesinatypicalmemoryhierarchy.

12ordersofmagnitude

6ordersofmagnitude

Page 33: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Summary

• MemoryHierarchy:combiningstoragebankswithdifferentlatencies.• Clusters:multiplecomputers,connectedbyethernet,thatsharetheirstorage.

Page 34: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Ashorthistoryofaffordablemassivecomputing.

Page 35: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Supercomputers

• Cray,DeepBlue,BlueGene…• Specializedhardware• Veryexpensive• createdtosolvespecializedimportantproblems

Page 36: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

DataCenters

Page 37: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

DataCenters• Thephysicalaspectof”thecloud”• Collectionofcommoditycomputers• VASTnumberofcomputers(100,000’s)• Createdtoprovidecomputationforlargeandsmallorganizations.• Computationasacommodity.

Page 38: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

MakingHistory:Google2003

• LarryPageandSergeyBrin developamethodforstoringverylargefilesonmultiplecommodity computers.• Eachfileisbrokenintofixed-sizechunks.• Eachchunkisstoredonmultiplechunkservers.• Thelocationsofthechunksismanagedbythemaster

Page 39: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

HDFS:Chunkingfiles

File1

File2

File1,Chunk1

File1,Chunk2

File2, Chunk1

File2,Chunk2

Split

Split

File1,Chunk1Copy1

File1,Chunk2Copy1

File2, Chunk1Copy1

File2,Chunk2Copy1

File1,Chunk1Copy2

File1,Chunk2Copy2

File2, Chunk1Copy2

File2,Chunk2Copy2

Copy

Copy

File1,Chunk2Copy3

Page 40: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

HDFS:DistributingChunks

File1,Chunk1Copy1

File1,Chunk2Copy1

File2, Chunk1Copy1

File2,Chunk2Copy1

File1,Chunk1Copy2

File1,Chunk2Copy2

File2, Chunk1Copy2

File2,Chunk2Copy2

File1,Chunk2Copy3

Page 41: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

PropertiesofGFS/HDFS

• CommodityHardware:Lowcostperbyteofstorage.• Locality: datastoredclosetoCPU.• Redundancy: canrecoverfromserverfailures.• Simpleabstraction: lookstouserlikestandardfilesystem(files,directories,etc.)Chunkmechanismishidden.

Page 42: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Redundancy

Page 43: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

ParallelismAssumeFile1containsalistofnumbers.

Serialcomputation:doeverythingononecomputer

Parallelmethod:processeachchunkonaseparatecomputer,thencombine.

Task:Sumallofthenumbersinfile1

Page 44: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Locality Becauseofredundancyitislikelythatatanymomentthereexistsanavailableworkerthatcontainsthechunkthemasterwishestoprocess.

Page 45: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Map-Reduce

• HDFSisastorageabstraction• Map-Reduce isacomputation abstractionthatworkswellwithHDFS• Allowsprogrammertospecifyparallelcomputationwithoutknowinghowthehardwareisorganized.• WewilldescribeMap-Reduce,usingSpark,inalatersection.

Page 46: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Spark

• DevelopedbyMatei Zaharia ,amplab,2014• Hadoopusessharedfilesystem (disk)• Sparkusessharedmemory – faster,lowerlatency.• Willbeusedinthiscourse

• Recallwordcountbysorting,wewillredoitusingmap-reduce!

Page 47: Storage Locality and counting words - GitHub Pages...Task: count the number of occurrences of each word in very long text. •Input:Call me Ishmael.Some years ago--never mind how long

Summary

• Bigdataanalysisisperformedonlargeclustersofcommoditycomputers.• HDFS(Hadoopfilesystem):breakdownfilestochunks,makecopies,distributerandomly.• HadoopMap-Reduce:acomputationabstractionthatworkswellwithHDFS• Spark:Sharingmemory insteadofsharingdisk.