Distributed Data Storage & Access Zachary G. Ives University of Pennsylvania CIS 455 / 555 –...


Citation preview

Distributed Data Storage & Access

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

April 19, 2023

Some slide content courtesy Tanenbaum & van Steen


Homework 2 Milestone 1 deadline imminent

Homework 2 Milestone 2 due Monday after Spring Break

Wed: Marie Jacob on the Q query answering system

Next week: Spring Break


Building Over a DHT

“Message passing” architecture to coordinate behavior among different nodes in an application Send a request to the “owner” of a key

Request contains a custom-formatted message type

Each node has an event handler loopswitch (msg.type) {case one:case two:…

The request handler may send back a result, as appropriate Requires that the message include info about who the

requestor was, how to return the data



Example: How Do We Create a Hash Table (Hash Multiset) Abstraction?

We want the following: put (key, value) remove (key) valueSet = get (key)

How can we use Pastry to do this? route() deliver()

An Alternate Programming Abstraction: GFS + MapReduce

Abstraction: Instead of sending messages, different pieces of code communicate through files Code is going to take a very “stylized” form; at

each stage each machine will get input from files, send output to files

Files are generally persistent, name-able (in contrast to DHT messages, which are transient)

Files consist of blocks, which are the basic unit of partitioning (in contrast to object / data item IDs)


Background: Distributed Filesystems

Many distributed filesystems have been developed: NFS, SMB are the most prevalent today Andrew FileSystem (AFS) was also fairly


Hundreds of other research filesystems, e.g., Coda, Sprite, … with different properties


NFS in a Nutshell

(Single) server, multi-client architecture Server is stateless, so clients must send all context

(including position to read from) in each request

Plugs into VFS APIs, mostly mimics UNIX semantics Opening a file requires opening each dir along the way

fd = open(“/x/y/z.txt”) will do a lookup for x from the root handle lookup for y from x’s handle lookup for z from y’s handle

Server must commit writes immediately Client does heavy caching – requires frequent polling

for validity, and/or use of external locking service


The Google File System (GFS)

Goals: Support millions of huge (many-TB) files Partition & replicate data across thousands of unreliable

machines, in multiple racks (and even data centers)

Willing to make some compromises to get there: Modified APIs – doesn’t plug into POSIX APIs

In fact, relies on being built over Linux file system

Doesn’t provide transparent consistency to apps! App must detect duplicate or bad records, support checkpoints

Performance is only good with a particular class of apps: Stream-based reads Atomic record appends


GFS Basic Architecture & Lookups

Files broken into 64MB “chunks” Master stores metadata; 3 chunkservers store each chunk

A single “flat” file namespace maps to chunks + replicas As with Napster, actual data transfer from chunkservers to client

No client-side caching!9

The Master: Metadata and Versions

Controls (and locks as appropriate): Mapping from files -> chunks within each namespace Controls reallocation, garbage collection of chunks

Maintains a log (replicated to backups) of all mutations to the above

Also knows mapping from chunk ID -> <version, {machines}> Doesn’t have persistent knowledge of what’s on

chunkservers Instead, during startup, it polls them … Or when one joins, it registers



Each holds replicas of some of the chunks

For a given write operation, one of the owners of the chunk gets a lease – becomes the primary and all others the secondary Receives requests for mutations Assigns an order Notifies the secondary nodes

Waits for all to say they received the message

Responds with a write-succeeded message Failure results in inconsistent data!!


A Write Operation

1. Client asks Master for lease-owning chunkserver

2. Master gives ID of primary, secondary chunkservers; client caches

3. Client sends its data to all replicas, in any order

4. Once client gets ACK, it requests primary to do a write of those data items. Primary assigns serial numbers to these operations.

5. Primary forwards write to secondaries (in a chain).

6. Secondaries reply “SUCCESS”7. Primary replies to client



GFS supports atomic append that multiple machines can use at the same time

Primary will interleave the requests in any order Will be written “at least once”!

Primary determines a position for the write, forwards this to the secondaries


Failures and the Client

If there is a failure in a record write or append, the client will generally retry If there was “partial success” in a previous

append, there might be more than one copy on some nodes – and inconsistency

Client must handle this through checksums, record IDs, and periodic checkpointing


GFS Performance

Many performance numbers in the paper Not enough context here to discuss them in much detail

– would need to see how they compare with other approaches!

But: validate high scalability in terms of concurrent reads, concurrent appends, with data partitioned and replicated across many machines

Also show fast recovery from failed nodes

Not the only approach to many of these problems, but one shown to work at industrial-strength!


A Popular Distributed Programming Model: MapReduce

In many circles, considered the key building block for much of Google’s data analysis A programming language built on it: Sawzall,

http://labs.google.com/papers/sawzall.html … Sawzall has become one of the most widely used programming

languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB).

Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad

Cloned in open source: Hadoop,http://hadoop.apache.org/core/

So what is it? What’s it good for?


MapReduce: Simple Distributed Functional Programming Primitives

Modeled after Lisp primitives:map (apply function to all items in a collection) and reduce (apply function to set of items with a common key)

We start with: A user-defined function to be applied to all data,

map: (key,value) (key, value) Another user-specified operation

reduce: (key, {set of values}) result A set of n nodes, each with data

All nodes run map on all of their data, producing new data with keys

This data is collected by key, then shuffled, reducedDataflow is through temp files on GFS


Some Example Tasks

Count word occurrences Map: output word with count 1 Reduce: sum the counts

Distributed grep – all lines matching a pattern Map: filter by pattern Reduce: output set

Count URL access frequency Map: output each URL as key, with count 1 Reduce: sum the counts

For each IP address, get the document with the most in-links

Number of queries by IP address (requires multiple steps)


MapReduce Dataflow Diagram(Default MapReduce Uses Filesystem)


Datapartitionsby key

Map compu-tation partitions Reduce compu-

tation partitions

Redistributionby output’s key


Some Details

Fewer computation partitions than data partitions All data is accessible via a distributed filesystem with

replication Worker nodes produce data in key order (makes it easy to

merge) The master is responsible for scheduling, keeping all nodes

busy The master knows how many data partitions there are, which

have completed – atomic commits to disk

Fault tolerance: master triggers re-execution of work originally performed by failed nodes – to make their data available again

Locality: master tries to do work on nodes that have replicas of the data


Hadoop: A “Modern” Open-Source “Clone” of MapReduce + GFS

Underlying Hadoop: HDFS, a page-level replicating filesystem Modeled in part after GFS

Supports “streaming” page access from each site Master/Slave: “Namenode” vs “Datanodes”

21Source: Hadoop HDFS architecture documentation

Hadoop HDFS + MapReduce


Source: “Meet Hadoop”, Devaraj Das, Yahoo Bangalore & Apache

Hadoop MapReduce Architecture

“Jobtracker” (Master): Accepts jobs submitted by users Gives tasks to Tasktrackers – makes scheduling

decisions, co-locates tasks to data Monitors task, tracker status, re-executes tasks

if needed

“Tasktrackers” (Slaves): Run Map and Reduce tasks Manage storage, transmission of intermediate



How Does this Relate to DHTs?

Consider replacing the filesystem with the DHT…


What Does MapReduce Do Well?

What are its strengths?

What about weaknesses?


MapReduce is a ParticularProgramming Model

… But it’s not especially general (though things like Pig Latin improve it)

Suppose we have autonomous application components that wish to communicate

We’ve already seen a few strategies: Request/response from client to server

HTTP itself Asynchronous messages

Router “gossip” protocols P2P “finger tables”, etc.

Are there general mechanisms and principles?(Of course!)

… Let’s first look at what happens if we need in-order messaging



Message-Queuing Model (1)

Four combinations for loosely-coupled communications using queues.



Message-Queuing Model (2)

Basic interface to a queue in a message-queuing system.

Primitive Meaning

Put Append a message to a specified queue

Get Block until the specified queue is nonempty, and remove the first message

Poll Check a specified queue for messages, and remove the first. Never block.

NotifyInstall a handler to be called when a message is put into the specified queue.


General Architecture of a Message-Queuing System (1)

The relationship between queue-level addressing and network-level addressing.


General Architecture of a Message-Queuing System (2)

The general organization of a message-queuing system with routers.



Benefits of Message Queueing

Allows both synchronous (blocking) and asynchronous (polling or event-driven) communication

Ensures messages are delivered (or at least readable) in the order received

The basis of many transactional systems e.g., Microsoft Message Queue (MMQ), IBM

MQseries, etc.
