CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

  • View
    225

  • Download
    3

Embed Size (px)

Text of CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google...

  • Slide 1
  • CSC 536 Lecture 8
  • Slide 2
  • Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)
  • Slide 3
  • Reactive Streams
  • Slide 4
  • Streams Stream Process involving data flow and transformation Data possibly of unbounded size Focus on describing transformation Examples bulk data transfer real-time data sources batch processing of large data sets monitoring and analytics
  • Slide 5
  • Needed: Asynchrony For fault tolerance: Encapsulation Isolation For scalability: Distribution across nodes Distribution across cores Problem: Managing data flow across an async boundary
  • Slide 6
  • Types of Async Boundaries between different applications between network nodes between CPUs between threads between actors
  • Slide 7
  • Possible solutions Traditional way: Synchronous/blocking (possibly remote) method calls Does not scale
  • Slide 8
  • Possible solutions Traditional way: Synchronous/blocking (possibly remote) method calls Does not scale Push way: Asynchronous/non-blocking message passing Scales! Problem: message buffering and message dropping
  • Slide 9
  • Supply and Demand Traditional way: Synchronous/blocking (possibly remote) method calls Does not scale Push way: Asynchronous/non-blocking message passing Scales! Problem: message buffering and message dropping Reactive way: non-blocking non-dropping
  • Slide 10
  • Reactive way View slides 24-55 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014
  • Slide 11
  • Supply and Demand data items flow downstream demand flows upstream data items flow only when there is demand recipient is in control of incoming data rate data in flight is bounded by signaled demand
  • Slide 12
  • Dynamic Push-Pull push behavior when consumer is faster pull behavior when producer is faster switches automatically between these batching demand allows batching data
  • Slide 13
  • Tailored Flow Control Splitting the data means merging the demand
  • Slide 14
  • Tailored Flow Control Merging the data means splitting the demand
  • Slide 15
  • Reactive Streams Back-pressured Asynchronous Stream Processing asynchronous non-blocking data flow asynchronous non-blocking demand flow Goal: minimal coordination and contention Message passing allows for distribution across applications across nodes across CPUs across threads across actors
  • Slide 16
  • Reactive Streams Projects Standard implemented by many libraries Engineers from Netflix Oracle Red Hat Twitter Typesafe See http://reactive-streams.orghttp://reactive-streams.org
  • Slide 17
  • Reactive Streams All participants had the same basic problem All are building tools for their community A common solution benefits everybody Interoperability to make best use of efforts minimal interfaces rigorous specification of semantics full TCK for verification of implementation complete freedom for many idiomatic APIs
  • Slide 18
  • The underlying (internal) API trait Publisher[T] { def subscribe(sub: Subscriber[T]): Unit } trait Subscription { def requestMore(n: Int): Unit def cancel(): Unit } trait Subscriber[T] { def onSubscribe(s: Subscription): Unit def onNext(elem: T): Unit def onError(thr: Throwable): Unit def onComplete(): Unit }
  • Slide 19
  • The Process
  • Slide 20
  • Reactive Streams All calls on Subscriber must dispatch async All calls on Subscription must not block Publisher is just there to create Subscriptions
  • Slide 21
  • Akka Streams Powered by Akka Actors Type-safe streaming through Actors with bounded buffering Akka Streams API is geared towards end-users Akka Streams implementation uses the Reactive Streams interfaces (Publisher/Subscriber) internally to pass data between the different processing stages
  • Slide 22
  • Examples View slides 62-80 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014 basic.scala TcpEcho.scala WritePrimes.scala
  • Slide 23
  • Overview of Googles distributed systems
  • Slide 24
  • Original Google search engine architecture
  • Slide 25
  • More than just a search engine
  • Slide 26
  • Organization of Googles physical infrastructure 40-80 PCs per rack (terabytes of disk space each) 30+ racks per cluster Hundreds of clusters spread across data centers worldwide
  • Slide 27
  • System architecture requirements Scalability Reliability Performance Openness (at the beginning, at least)
  • Slide 28
  • Overall Google systems architecture
  • Slide 29
  • Google infrastructure
  • Slide 30
  • Design philosophy Simplicity Software should do one thing and do it well Provable performance every millisecond counts Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.) Testing if it aint broke, youre not trying hard enough Stringent testing
  • Slide 31
  • Data and coordination services Google File System (GFS) Broadly similar to NFS and AFS Optimized to type of files and data access used by Google BigTable A distributed database that stores (semi-)structured data Just enough organization and structure for the type of data Google uses Chubby a locking service (and more) for GFS and BigTable
  • Slide 32
  • GFS requirements Must run reliably on the physical platform Must tolerate failures of individual components So application-level services can rely on the file system Optimized for Googles usage patterns Huge files (100+MB, up to 1GB) Relatively small number of files Accesses dominated by sequential reads and appends Appends done concurrently Meets the requirements of the whole Google infrastructure scalable, reliable, high performance, open Important: throughput has higher priority than latency
  • Slide 33
  • GFS architecture File stored in 64MB chunks in a cluster with a master node (operations log replicated on remote machines) hundreds of chunk servers Chunks replicated 3 times
  • Slide 34
  • Reading and writing When the client wants to access a particular offset in a file The GFS client translates this to a (file name, chunk index) And then send this to the master When the master receives the (file name, chunk index) pair It replies with the chunk identifier and replica locations The client then accesses the closest chunk replica directly No client-side caching Caching would not help in the type of (streaming) access GFS has
  • Slide 35
  • Keeping chunk replicas consistent
  • Slide 36
  • When the master receives a mutation request from a client the master grants a chunk replica a lease (replica is primary) returns identity of primary and other replicas to client The client sends the mutation directly to all the replicas Replicas cache the mutation and acknowledge receipt The client sends a write request to primary Primary orders mutations and updates accordingly Primary then requests that other replicas do the mutations in the same order When all the replicas have acknowledged success, the primary reports an ack to the client What consistency model does this seem to implement?
  • Slide 37
  • GFS (non-)guarantees Writes (at a file offset) are not atomic Concurrent writes to the same location may corrupt replicated chunks If any replica is left inconsistent, the write fails (and is retried a few times) Appends are executed atomically at least once Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appends GFS does not guarantee that the replicas are identical It only guarantees that some file regions are consistent across replicas When needed, GFS needs an external locking service (Chubby) As well as a leader election service (also Chubby) to select the primary replica
  • Slide 38
  • Bigtable GFS provides raw data storage Also needed: Storage for structured data...... optimized to handle the needs of Googles apps...... that is reliable, scalable, high-performance, open, etc
  • Slide 39
  • Examples of structured data URLs: Content, crawl metadata, links, anchors, PageRank,... Per-user data: User preference settings, recent queries/search results, Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations,
  • Slide 40
  • Commercial DB Why not use commercial database? Not scalable enough Too expensive Full-featured relational database not required Low-level optimizations may be needed
  • Slide 41
  • Bigtable table Implementation: Sparse distributed multi-dimensional map (row, column, timestamp) cell contents
  • Slide 42
  • Rows Each row has a key A string up to 64KB in size Access to data in a row is atomic Rows ordered lexicographically Rows close together lexicographically reside on one or close machines (locality)
  • Slide 43
  • Columns com.cnn.www contents:. CNN Spo