Transcript
Page 1: azure track -04- azure storage deep dive

Storage ServicesYves Goeleven

Page 2: azure track -04- azure storage deep dive

Still alive?

Page 3: azure track -04- azure storage deep dive

Yves GoelevenFounder of MessageHandler.net

• Windows Azure MVP

Page 4: azure track -04- azure storage deep dive

Exhibition theater @ kinepolis

Page 5: azure track -04- azure storage deep dive
Page 6: azure track -04- azure storage deep dive

www.messagehandler.net

Page 7: azure track -04- azure storage deep dive

What you might have expectedAzure Storage Services

• VM (Disk persistence)• Azure Files• CDN• StoreSimple• Backup & replica’s• Application storage• Much much more...

Page 8: azure track -04- azure storage deep dive

I assume you know what storage services does!

Page 9: azure track -04- azure storage deep dive

The real agendaDeep Dive Architecture

• Global Namespace• Front End Layer• Partition Layer• Stream Layer

Page 10: azure track -04- azure storage deep dive

WARNING!

Page 11: azure track -04- azure storage deep dive

Level 666

Page 12: azure track -04- azure storage deep dive
Page 13: azure track -04- azure storage deep dive

Which discusses howStorage defies the rules of the CAP theorem

• Distributed & scalable system

• Highly available• Strongly Consistent• Partition tolerant

• All at the same time!

Page 14: azure track -04- azure storage deep dive

That can’t be done, right?

Page 15: azure track -04- azure storage deep dive

Let’s dive in

Page 16: azure track -04- azure storage deep dive

Global namespace

Page 17: azure track -04- azure storage deep dive

Global namespaceEvery request gets sent to

• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName

Page 18: azure track -04- azure storage deep dive

Global namespaceDNS based service location

• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName

• DNS resolution• Datacenter (f.e. Western Europe)• Virtual IP of a Stamp

Page 19: azure track -04- azure storage deep dive

StampScale unit

• A windows azure deployment• 3 layers

• Typically 20 racks each• Roughly 360 XL instances

• With disks attached• 2 Petabyte (before june 2012)• 30 Petabyte

LB

Front End Layer

Partition Layer

Stream Layer

Intra-stamp replication

Page 20: azure track -04- azure storage deep dive

Incoming read requestArrives at Front End

• Routed to partition layer• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • Reads usually from memory

Partition Layer

Front End LayerLB

FE FE FE FE FE FE FE FE FE

PS PS PS PS PS PS PS PS PS

Page 21: azure track -04- azure storage deep dive

Incoming read request2 nodes per request

• Blazingly fast• Flat “Quantum 10” network topology• 50 Tbps of bandwidth

• 10 Gbps / node• Faster than typical ssd drive Switch Switch

Switch Switch

Page 22: azure track -04- azure storage deep dive

Incoming Write requestWritten to stream layer

• Synchronous replication • 1 Primary Extend Node• 2 Secondary Extend Nodes

Front End Layer

Stream Layer

Partition Layer

Primary SecondarySecondary

PS PS PS PS PS PS PS PS PS

EN EN EN EN EN EN EN

Page 23: azure track -04- azure storage deep dive

Geo replication

LB

Front End Layer

Partition Layer

Stream Layer

Intra-stamp replication

LB

Front End Layer

Partition Layer

Stream Layer

Intra-stamp replication

Inter-stamp replication

Location Service

Page 24: azure track -04- azure storage deep dive

Location ServiceStamp Allocation

• Manages DNS records• Determines active/passive stamp configuration

• Assigns accounts to stamps• Geo Replication• Disaster Recovery• Load Balancing• Account Migration

Page 25: azure track -04- azure storage deep dive

Front end layer

Page 26: azure track -04- azure storage deep dive

Front End NodesStateless role instances

• Authentication & Authorization• Shared Key (storage key)• Shared Access Signatures

• Routing to Partition Servers • Based on PartitionMap (in memory cache)

• Versioning • x-ms-version: 2011-08-18

• Throttling

Front End Layer

FE FE FE FE FE FE FE FE FE

Page 27: azure track -04- azure storage deep dive

Partition MapDetermine partition server

• Used By Front End Nodes• Managed by Partition Manager (Partition Layer)• Ordered Dictionary

• Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName

• Value: Partition Server• Split across partition servers

• Range Partition

Page 28: azure track -04- azure storage deep dive

Practical examplesBlob

• Each blob is a partition• Blob uri determines range• Range only matters for listing• Same container: More likely same range

• https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg• https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg

• Other container: Less likely• https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg

PS PS PS

Page 29: azure track -04- azure storage deep dive

Practical examplesQueue

• Each queue is a partition• Queue name determines range• All messages same partition• Range only matters for listing queues

PS PS PS

Page 30: azure track -04- azure storage deep dive

Practical examplesTable

• Each group of rows with same PartitionKey is a partition• Ordering within set by RowKey• Range matters a lot! • Expect continuation tokens!

PS PS PS

Page 31: azure track -04- azure storage deep dive

Continuation tokensRequests across partitions

• All on same partition server• 1 result

• Across multiple ranges (or to many results)• Sequential queries required• 1 result + continuation token• Repeat query with token until done

• Will almost never occur in development!• As dataset is typically small

PS PS PS

Page 32: azure track -04- azure storage deep dive

Partition layer

Page 33: azure track -04- azure storage deep dive

Partition layerDynamic Scalable Object Index

• Object Tables• System defined data structures• Account, Entity, Blob, Message, Schema, PartitionMap

• Dynamic Load Balancing• Monitor load to each part of the index to determine hot spots• Index is dynamically split into thousands of Index Range Partitions

based on load• Index Range Partitions are automatically load balanced across

servers to quickly adapt to changes in load• Updates every 15 seconds

Page 34: azure track -04- azure storage deep dive

Consists of

• Partition Master• Distributes Object Tables• Partitions across Partition Servers• Dynamic Load Balancing

• Partition Server• Serves data for exactly 1 Range Partition• Manages Streams

• Lock Server• Master Election• Partition Lock

Partition layer

Partition Layer

PS PS PS

PM PM

LS LS

Page 35: azure track -04- azure storage deep dive

How it works

Dynamic Scalable Object Index

Partition Layer

PS PS PS

PM PM

LS LS

Front End Layer

FE FE FE FE

Account

Container

Blob

Aaaa Aaaa Aaaa

Harry Pictures Sunrise

Harry Pictures Sunset

Richard Images Soccer

Richard Images Tennis

Zzzz Zzzz zzzz

A - H

H’- R

R’- Z

Map

Page 36: azure track -04- azure storage deep dive

Not in the critical path!

Dynamic Load Balancing

Stream Layer

Partition Layer

Primary SecondarySecondary

PS PS PS PS PS PS PS PS PS

EN EN EN EN EN EN EN

PM PM

LS LS

Page 37: azure track -04- azure storage deep dive

Architecture

• Log Structured Merge-Tree• Easy replication

• Append only persisted streams• Commit log• Metadata log• Row Data (Snapshots)• Blob Data (Actual blob data)

• Caches in memory• Memory table (= Commit log)• Row Data• Index (snapshot location)• Bloom filter• Load Metrics

Partition Server

Memory Data

Persistent Data

Commit Log Stream

Metadata Log Stream

Row Data Stream

Blob Data Stream

MemoryTable

RowCache

IndexCache

BloomFilters

Load Metrics

Page 38: azure track -04- azure storage deep dive

HOT Data

• Completely in memory

Incoming Read request

Memory Data

Persistent DataCommit Log

StreamMetadata Log

Stream

Row Data Stream

Blob Data Stream

MemoryTable

RowCache Index

CacheBloomFilters

Load Metrics

Page 39: azure track -04- azure storage deep dive

COLD Data

• Partially Memory

Incoming Read request

Memory Data

Persistent DataCommit Log

StreamMetadata Log

Stream

Row Data StreamRow Data Stream

Blob Data Stream

MemoryTable

RowCache Index

CacheBloomFilters

Load Metrics

Page 40: azure track -04- azure storage deep dive

Persistent First

• Followed by memory

Incoming Write Request

Memory Data

Persistent DataCommit Log

StreamCommit Log

StreamMetadata Log

Stream

Row Data Stream

Blob Data Stream

MemoryTable

RowCache Index

CacheBloomFilters

Load Metrics

Page 41: azure track -04- azure storage deep dive

Not in critical path!

Checkpoint process

Memory Data

Persistent DataCommit Log Stream

Commit Log Stream

Metadata Log Stream

Row Data Stream

Blob Data Stream

MemoryTable

RowCache Index

CacheBloomFilters

Load Metrics

Row Data StreamCommit Log Stream

Page 42: azure track -04- azure storage deep dive

Not in critical path!

Geo replication process

Memory Data

Persistent Data

Commit Log Stream

Commit Log Stream

Metadata Log Stream

Row Data Stream

Blob Data Stream

MemoryTable

RowCache

IndexCache

BloomFilters

Load MetricsMemory Data

Persistent Data

Commit Log Stream

Commit Log Stream

Metadata Log Stream

Row Data Stream

Blob Data Stream

MemoryTable

RowCache

IndexCache

BloomFilters

Load Metrics

Row Data Stream

Page 43: azure track -04- azure storage deep dive

Stream layer

Page 44: azure track -04- azure storage deep dive

Append-Only Distributed File System

• Streams are very large files• File system like directory namespace

• Stream Operations• Open, Close, Delete Streams• Rename Streams• Concatenate Streams together• Append for writing• Random reads

Stream layer

Page 45: azure track -04- azure storage deep dive

Concepts

• Block• Min unit of write/read• Checksum• Up to N bytes (e.g. 4MB)

• Extent• Unit of replication• Sequence of blocks• Size limit (e.g. 1GB)• Sealed/unsealed

• Stream• Hierarchical namespace• Ordered list of pointers to extents• Append/Concatenate

Stream layer

Extent

Stream //foo/myfile.data

Extent Extent Extent

Pointer 1 Pointer 2 Pointer 3

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Page 46: azure track -04- azure storage deep dive

Consists of

• Paxos System• Distributed consensus protocol• Stream Master

• Extend Nodes• Attached Disks• Manages Extents• Optimization Routines

Stream layer

Stream Layer

EN EN EN EN

SMSM

SM

Paxos

Page 47: azure track -04- azure storage deep dive

Allocation process

• Partition Server• Request extent / stream creation

• Stream Master• Chooses 3 random extend nodes based

on available capacity• Chooses Primary & Secondary's• Allocates replica set

• Random allocation• To reduce Mean Time To Recovery

Extent creation

Stream Layer

EN EN EN

SM

Partition LayerPS

PrimarySecondarySecondary

Page 48: azure track -04- azure storage deep dive

Bitwise equality on all nodes

• Synchronous• Append request to primary • Written to disk, checksum• Offset and data forwarded to secondaries• Ack

Replication Process

Journaling

• Dedicated disk (spindle/ssd)• For writes• Simultaneously copied to data disk

(from memory or journal)

Stream Layer

EN EN EN

SM

Partition LayerPS

Primary Secondary Secondary

Page 49: azure track -04- azure storage deep dive

Ack from primary lost going back to partition layer

• Retry from partition layer • can cause multiple blocks to be appended (duplicate records)

Dealing with failure

Unresponsive/Unreachable Extent Node (EN)

• Seal the failed extent & allocate a new extent and append immediately

ExtentExtent

Stream //foo/myfile.data

Extent Extent

Pointer 1 Pointer 2 Pointer 3

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Extent

Blo

ck

Blo

ck

Blo

ck

Extent

Blo

ck

Page 50: azure track -04- azure storage deep dive

Replication Failures (Scenario 1)

Stream Layer

EN EN EN

SM

Partition Layer

PS

Primary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Seal

120120

Primary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Page 51: azure track -04- azure storage deep dive

Replication Failures (Scenario 2)

Stream Layer

EN EN EN

SM

Partition Layer

PS

Primary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Seal

120100

Primary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck?

Page 52: azure track -04- azure storage deep dive

Data Stream

• Partition Layer only reads from offsets returned from successful appends

• Committed on all replicas• Row and Blob Data Streams• Offset valid on any replica• Safe to read

Network partitioning during read

Stream Layer

EN EN EN

SM

Partition Layer

PS

Primary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Page 53: azure track -04- azure storage deep dive

Log Stream

• Logs are used on partition load• Check commit length first• Only read from

• Unsealed replica if all replicas have the same commit length

• A sealed replica

Network partitioning during read

Stream Layer

EN EN EN

SM

Partition Layer

PS

Primary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

1, 2

Seal

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Secondary

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Page 54: azure track -04- azure storage deep dive

Downside to this architecture

• Lot’s of wasted disk space• ‘Forgotten’ streams• ‘Forgotten’ extends• Unused blocks in extends after write errors

• Garbage collection required• Partition servers mark used streams• Stream manager cleans up unreferenced streams• Stream manager cleans up unreferenced extends• Stream manager performs erasure coding

• Reed Solomon algorithm• Allows recreation of cold streams after deleting extends

Garbage Collection

Page 55: azure track -04- azure storage deep dive

Wrapup

Page 56: azure track -04- azure storage deep dive

Layering & co-design

• Stream Layer• Availability with Partition/failure tolerance• For Consistency, replicas are bit-wise identical up to the commit length

• Partition Layer• Consistency with Partition/failure tolerance• For Availability, RangePartitions can be served by any partition server • Are moved to available servers if a partition server fails

• Designed for specific classes of partitioning/failures seen in practice• Process to Disk to Node to Rack failures/unresponsiveness• Node to Rack level network partitioning

Beating the CAP theorem

Page 58: azure track -04- azure storage deep dive

Still alive?

Page 59: azure track -04- azure storage deep dive

And take home the Lumia 1320

Present your feedback form when you exit the last session & go for the drink

Give Me Feedback

Page 60: azure track -04- azure storage deep dive

Follow Technet Belgium@technetbelux

Subscribe to the TechNet newsletteraka.ms/benews

Be the first to know

Page 61: azure track -04- azure storage deep dive

Belgiums’ biggest IT PRO Conference