25
Google Base Infrastructure: GFS andMapReduce Johannes Passing

Gfs andmapreduce

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Gfs andmapreduce

Google Base Infrastructure:

GFS and MapReduce

Johannes Passing

Page 2: Gfs andmapreduce

Agenda

� Context

� Google File System

� Aims

� Design

� Walk through� Walk through

� MapReduce

� Aims

� Design

� Walk through

� Conclusion

13.04.2008 Johannes Passing 2

Page 3: Gfs andmapreduce

Context

� Google‘s scalability strategy

� Cheap hardware

� Lots of it

� More than 450.000 servers (NYT, 2006)

� Basic Problems

13.04.2008 Johannes Passing 3

� Basic Problems

� Faults are the norm

� Software must be massively parallelized

� Writing distributed software is hard

Page 4: Gfs andmapreduce

GFS: Aims

� Aims

� Must scale to thousands of servers

� Fault tolerance/Data redundancy

� Optimize for large WORM files

� Optimize for large reads

� Favor throughput over latency� Favor throughput over latency

� Non-Aims

� Space efficiency

� Client side caching futile

� POSIX compliance

� Significant deviation from standard DFS (NFS, AFS, DFS, …)

13.04.2008 Johannes Passing 4

Page 5: Gfs andmapreduce

Design Choices

� File system cluster

� Spread data over thousands of machines

� Provide safety by redundant storage

� Master/Slave architecture

13.04.2008 Johannes Passing 5

Page 6: Gfs andmapreduce

Design Choices

� Files split into chunks

� Unit of management and distribution

� Replicated across servers

� Subdivided into blocks for integrity checks

GFS

� Accept internal fragmentation for better throughput

13.04.2008 Johannes Passing 6

File System

Page 7: Gfs andmapreduce

Architecture: Chunkserver

� Host chunks

� Store chunks as files in Linux FS

� Verify data integrity

� Implemented entirely in user mode

� Fault tolerance

� Fail requests on integrity violation detection

� All chunks stored on at least one other server

� Re-replication restores server after downtime

13.04.2008 Johannes Passing 7

Page 8: Gfs andmapreduce

Architecture: Master

� Namespace and metadata management

� Cluster management

� Chunk location registry (transient)

� Chunk placement decisions

� Re-replication, Garbage collection� Re-replication, Garbage collection

� Health monitoring

� Fault tolerance

� Backed by shadow masters

� Mirrored operations log

� Periodic snapshots

13.04.2008 Johannes Passing 8

Page 9: Gfs andmapreduce

Reading a chunk

� Send filename and offset

� Determine file and chunk

� Return chunk locations

and version

� Choose closest server

and request chunk

13.04.2008 Johannes Passing 9

� Verify chunk version and

block checksums

� Return data if valid, else

fail request

Page 10: Gfs andmapreduce

Writing a chunk

� Send filename and offset

� Return chunk locations

and lease owner

� Push data to any chunkserver

� Determine lease owner/

grant lease

13.04.2008 Johannes Passing 10

� Forward along replica chain

� Create mutation order, apply

� Forward write request

� Send write request to primary

� Validate blocks, apply

modifications , inc version, ack

� Reply to client

Page 11: Gfs andmapreduce

Appending a record (1/2)

� Send append request

� Return chunk locations

and lease owner

� Push data to any chunkserver

� Choose offset

13.04.2008 Johannes Passing 11

� Forward along replica chain

� Check for left space in chunk

� Pad chunk and request retry

� Send append request

� Request chunk allocation

� Return chunk locations

and lease owner

Page 12: Gfs andmapreduce

Appending a record (2/2)

� Send append request

� Forward request to replicas

� Write data, failure occurs

� Fail request

� Allocate chunk, write data

13.04.2008 Johannes Passing 12

� Fail request

� Write data

� Forward request to replicas

� Client retries

Padding

A

Padding

A

Padding

A

Free

B

Free

B

Free

B

B� Write data

Free

B

B

Free

B

Garbage

� Success

Free

Garbage

Page 13: Gfs andmapreduce

File Region Consistency

� File region states:

� Consistent � All clients see same data

� Defined � Consistent ^ clients see change in its entirety

� Inconsistent

� Consequences

13.04.2008 13Johannes Passing

Free

B

B

Free

B

B

inconsistent

defined

C

B

B

C

B

B

C

B

B

D

C

B

B

D definedconsistentFree

Garbage

B

C

Garbage

B

C

Garbage

B

D

� Consequences

� Unique record IDs to identify duplicates

� Records should be self-identifying and self-validating

Page 14: Gfs andmapreduce

Fault tolerance GFS

� Automatic re-replication

� Replicas are spread across racks

� Fault mitigations

� Data corruption � choose different replica� Data corruption � choose different replica

� Machine crash � choose replica on different machine

� Network outage � choose replica in different network

� Master crash � shadow masters

13.04.2008 Johannes Passing 14

Page 15: Gfs andmapreduce

Wrap-up GFS

� Proven scalability, fault tolerance and performance

� Highly specialized

� Distribution transparent to clients

� Only partial abstraction – clients must cooperate

� Network is the bottleneck� Network is the bottleneck

13.04.2008 Johannes Passing 15

Page 16: Gfs andmapreduce

MapReduce: Aims

� Aims

� Unified model for large scale distributed data processing

� Massively parallelizable

� Distribution transparent to developer

� Fault tolerance� Fault tolerance

� Allow moving of computation close to data

13.04.2008 Johannes Passing 16

Page 17: Gfs andmapreduce

Higher order functions

� Functional: map(f, [a, b, c, …])

f

a

f(a)

f

b

f(b)

f

c

f(c)

13.04.2008 Johannes Passing 17

� Google: map(k, v)

(k, v)

map

(f(k1), f(v1)) (f(k1), f(v1)) (f(k1), f(v1))

Page 18: Gfs andmapreduce

Higher order functions

� Functional: foldl/reduce(f, [a, b, c, …], i)

f

f

b

f

c

13.04.2008 Johannes Passing 18

f

i a

b

� Google: reduce(k, [a, b, c, d, …])

a

f(a)

b

c

f(c, d)

d

Page 19: Gfs andmapreduce

Idea

� Observation

� Map can be easily parallelized

� Reduce might be parallelized

� Idea� Idea

� Design application around map and reduce scheme

� Infrastructure manages scheduling and distribution

� User implements map and reduce

� Constraints

� All data coerced into key/value pairs

13.04.2008 Johannes Passing 19

Page 20: Gfs andmapreduce

Walkthrough

� Provide data

� e.g. from GFS or BigTable

� Create M splits

� Map

� Spawn M mappers

� Spawn master

13.04.2008 Johannes Passing 20

ReducersMappers

� Map

� Once per key/value pair

� Partition into R buckets

� Combine

� Reduce

� Spawn up to R*M combiners

� Spawn up to R reducers

Combiners

� Barrier

Page 21: Gfs andmapreduce

Scheduling

� Locality

� Mappers scheduled close to data

� Chunk replication improves locality

� Reducers run on same machine as mappers

� Choosing M and R

� Load balancing & fast recovery vs. number of output files

� M >> R, R small multiple of #machines

� Schedule backup tasks to avoid stragglers

13.04.2008 Johannes Passing 21

Page 22: Gfs andmapreduce

Fault tolerance MapReduce

� Fault mitigations

� Map crash

� All intermediate data in questionable state

�Repeat all map tasks of machine

�Rationale of barrier�Rationale of barrier

� Reduce crash

� Data is global, completed stages marked

�Repeat crashed resuce task only

� Master crash

� start over

� Repeated crashs on certain records

�Skip records13.04.2008 Johannes Passing 22

Page 23: Gfs andmapreduce

Wrap-up MapReduce

� Proven scalability

� Restrict programming model for better runtime support

� Tailored to Google’s needs

� Programming model is fairly low-level

13.04.2008 Johannes Passing 23

Page 24: Gfs andmapreduce

Conclusion

� Rethinking infrastructure

� Consequent design for scalability and performance

� Highly specialized solutions

� Not generally applicable

� Solutions more low-level than usual� Solutions more low-level than usual

� Maintenance efforts may be significant

13.04.2008 Johannes Passing 24

„We believe we get tremendous competitive advantage by

essentially building our own infrastructures“

--Eric Schmidt

Page 25: Gfs andmapreduce

References

� Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, 2004

� Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”, 2003

� Ralf Lämmel, “Google's MapReduce Programming Model –Revisited”, SCP journal, 2006

� Google, “Cluster Computing and MapReduce”, � Google, “Cluster Computing and MapReduce”, http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html, 2007

� David F. Carr, “How Google Works”, http://www.baselinemag.com/c/a/Projects-Networks-and-Storage/How-Google-Works/, 2006

� Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, “Evaluating MapReduce for Multi-core and Multiprocessor Systems”, 2007

13.04.2008 Johannes Passing 25