View
220
Download
1
Category
Tags:
Preview:
Citation preview
OceanStore: An Infrastructure for
Global-Scale Persistent Storage
John Kubiatowicz, David Bindel, Yan Chen, Steven
Czerwinski,
Patrick Eaton, Dennis Geels, Ramakrishna
Gummadi, Sean Rhea,
Hakim Weatherspoon, Westley Weimer, Chris Wells,
Ben ZhaoA few slides have been borrowed from the authors’ presentations
Vision
• What is Oceanstore?• “a utility infrastructure to span the globe and provide continuous access to persistent information”
Source: Berkeley OceanStore Website
Vision
• What is Oceanstore?
• “a utility infrastructure to span the globe and
provide continuous access to persistent
information”
• data
• all kinds of information
• desktop, laptop, palmtop
• cars, cellular phones, other devices
• futuristic: embedded in environment
Vision
• What is Oceanstore?
• “a utility infrastructure to span the globe and
provide continuous access to persistent
information”
• persistence
• devices can be rebooted, lost, replaced
• reliable, durable data (“deep archival” will last
forever)
• Automatic maintenance
Vision
What is Oceanstore?
• “a utility infrastructure to span the globe and
provide continuous access to persistent information”
• connectivity
• even to tiniest devices, possibly intermittent
• variable bandwidth, latency
• availability
• uniform access, comparable to LAN-based networked
storage
• fault-tolerant, DoS-tolerant
Vision
• what is oceanstore?
• “a utility infrastructure to span the globe and
provide continuous access to persistent
information”
• scale
• geographically distributed
• 1010 users
• 1014 files / objects
Questions about information:
• Where is persistent information stored?• 20th-century tie between location and content outdated
• In world-scale system, locality is key
• How is it protected?• Can disgruntled employee of ISP sell your secrets?• Can’t trust anyone (how paranoid are you?)
• Can we make it indestructible? • Want our data to survive “the big one”! • Highly resistant to hackers (denial of service)• Wide-scale disaster recovery
• Is it hard to manage?• Worst failures are human-related• Want automatic (introspective) diagnosis and repair
First Observation:Want Utility Infrastructure
• Mark Weiser from Xerox: Transparent computing is the
ultimate goal. Computers should disappear into the background
• In the context of storage:
• Don’t want to worry about backup
• Don’t want to worry about obsolescence
• Need lots of resources to make data secure and highly
available, BUT don’t want to own them
• Outsourcing of storage already becoming popular
• Pay monthly fee and your “data is out there”
• Service provided by confederation of companies
• Monthly fee paid to one service provider
• Companies buy and sell capacity from each other
Utility-based Infrastructure
Pac Bell
Sprint
IBMAT&T
CanadianOceanStore
IBM
Target applications
Group calendar, contacts
Distributed design tools
Computer Supported Cooperative Work
Digital libraries
Distributed/shared repositories
Assumptions
• Untrusted infrastructure• A small number of servers may crash or leak information
• most of the servers functioning correctly• financially “responsible party” of servers ensure integrity
• but only clients trusted with cleartext
• Nomadic data• data divorced from location• flows freely within the storage infrastructure• promiscuous caching: “anywhere, anytime”• location important for performance• dynamic system tuning through introspection
System overview
• persistent object• GUID: 160-bit SHA-1 hash
• secure identification – globally unique and unforgeable• 280 unique objects before collisions (birthday paradox)
• floating object replicas: independent of location• encrypted data
• read• try fast probabilistic replica search (Bloom filter)• fallback to slower deterministic search (Tapestry)
• write• update with predicates [as in Bayou – what is Bayou?]• creates new version
What is Bayou
The Bayou System (Xerox PARC) is a
platform of replicated, highly-available,
variable-consistency, databases on which
collaborative applications can be built.
It caters to portable devices having
intermittent connections.
System overview
• application interface
• sessions: sequence of read/writes
• session guarantees [Bayou]
• loose consistency levels, ACID
• active and archival forms
• active: latest version, with update handle
• archive: erasure coded read-only version
• dynamic optimization
• object location
• degree of replication
naming
• self-certifying path names (Mazières)
• object GUID = hash of owner key and readable name
• create hierarchies using “directory” objects
• read restriction
• through client encryption of data
• write restriction, access control
• associate ACL lists with object, respected by
servers
addressing
• address an object by its GUID
• message: GUID, random number, small predicate
• route to closest GUID replica matching predicate
• combines data location and routing:
• no central name service to attack
• save one round-trip for location discovery
• routing
• fast, probabilistic search algorithm
• slow, deterministic search algorithm
routing
• fast, probabilistic search algorithm
• Bloom filter
• probabilistic set membership test using bit vector
• n-bit vector generated from n hashes of each set
element
• filter is union (OR) of all bit vectors
• attenuated Bloom filter
• array of d Bloom filters
• i th Bloom filter is union of all <i -hop nodes
• slow, deterministic algorithm
• Tapestry
updates
• Updates based on versioning and conflict
resolution
• i.e. no locking
• update: actions with predicates
• commit – apply action of first true predicate
• abort – no true predicates
• conflict resolution on encrypted data
• possible predicates:
• compare-version, compare-size, compare-block, search
• possible actions:
• replace-block, insert-block, delete-block, append
archival
• produced when objects idle
• use erasure codes (redundant fragmentation)
• simplest example: parity bit
• need any (n-1) out of n fragments
• interleaved Reed-Solomon codes, Tornado codes
• fragmentation improves reliability
• “deep archival storage”
• sweeper processes ensure replication
sustained over time
• fragmentation improves performance
Floating Replica and Deep Archival Coding
Erasure-coded Fragments
Ver1: 0x34243Ver2: 0x49873Ver3: …
FullCopy
Conflict Resolution
LogsVer1: 0x34243Ver2: 0x49873Ver3: …
FullCopy
Conflict Resolution
Logs
Ver1: 0x34243Ver2: 0x49873Ver3: …
FullCopy
Conflict Resolution
Logs
FloatingReplica
dynamic optimization (introspection)
• observation modules
• collect and summarize information
• incrementally update system database
• optimization modules
• periodically process the observation database
• cluster recognition: group related objects
• replica management: maintain replica number and
location
• periodic migration: work-home-work-home…
• maintenance: routing, dissemination, availability,
durability
Recommended