Building A Scalable Big Data System for the Internet of Things (IoT))

Preview:

Citation preview

The Internet of Things

Company

Vinay Nathan, CEO 15 years of varied experience across sales,

marketing, engineering and PM Most recently, VP Sales at Persistent

Systems

Yogesh Kulkarni, COO 16 years of product engineering

experience in global product companies Most recently, Director - Product

Development at BMC Software

Ranjit Nair, CTO 16 years of software architecture and

engineering experience Most recently, Engineering Manager at

Amazon

About Altizon

And this is what we do

My Motivation

IoT

IoT is the integration of the physical world into the computing world

IoT is a Big-Data problem

• Massive amounts of data

• Machines are commonly sampled for data at millisecond intervals.

• Volume, variety and velocity.

• That needs to be analyzed in real time

• Condition based monitoring

• Anomaly detection

• That need to be analyzed for actionable insights

• Efficiency, utilization, machine health

• That need supervised and unsupervised machine learning

• Predictive maintenance, proactive support

Cloud

Cloud EdgeCloud EdgeCloud Edge

Edge

Topology

EdgeEdge

The Edge

Sensors

• Network and connectivity

• Wifi, BLE, Zigbee, 6LoWPAN

• Protocols

• MQTT, CoAP, AMQP

• Low Complexity

• Security

• Upgrades

Edge

• Network Protocols

• Higher Complexity

• Bidirectional communication

• Integration

• Device context

• Security

Be cloud agnostic

The Cloud Edge

• Protocol Adapters• Edge to cloud protocols

• Filtering rules and aggregations• Batching• Local controller• Highly available• Load balanced

Event Ingestion at Scale

• Device auto-discovery• Metadata driven device discovery

• Device Telemetry Data• Time-series data• Which can be out of sequence

• Alerts and logs• Event validation• Bandwidth and backpressure

• Portable deployment of applications as a single

object versus process sandboxing• Application-centric versus machine/server-

centric• Supports for automatic container builds• Built-in version tracking• Reusable components• Public registry for sharing containers• A growing tools ecosystem from the published

API. https://www.docker.com/what-docker

backend datonis-events balance source server event1 event1.datonis.io:80 check server event2 event2.datonis.io:80 check

backend datonis-api balance roundrobin mode http server api1 api1.datonis.io:80 check server api2 api2.datonis.io:80 check

frontend http bind *:80 mode http

acl events path_beg /event use_backend datonis-events if events

default_backend datonis-api

HAProxy

• Entities• Broker, Topic, Producer, Consumer

• A sharded write ahead log• Contiguous memory allocation• Index and offset• Messages are not deleted on read

• But on an SLA• Data reloads

• Log replication for fault tolerance• Making reads faster

• Kafka-Spark consumer

• Caching.

• Redis can be used in the same manner as memcache• Counting stuff. Atomic counters • Show latest items.

• This is a live in-memory cache and is very fast. • Deletion and filtering.

• If a cached article is deleted it can be removed from the cache using.

• Leaderboards and related problems. • Implement expires on items.• Unique N items in a given amount of time. • Pub/Sub. • Queues.

Real time CEP

• Apache Spark• Unified stream, batch processing and

machine learning

• RDDs• Immutable, resilient, distributed collection

of records.

• DStreams• A continuous sequence of RDDs

val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

val ssc = new StreamingContext(sparkConf, Seconds(1))

val lines = ssc.socketTextStream(args(0), args(1))val words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()ssc.awaitTermination()

Spark

Spark Streaming

Actually this is CEP

Why do we love Spark

• Common logic for stream and batch processing• No separate architectures and approaches• Storm would have been appropriate for absolute real-time

• A hit with data-scientists• Rapid iterations on large data sets

• Language support• Python, Java, Scala and R• R syntax is extremely baffling (or maybe I’m just too old)

• Spark MLIB• Statistics, classification, filtering, clustering, feature extraction • The list is constantly growing

Persistence

• Why Mongo?• Concerns around separate databases for transactional data and

event data• Premature optimization

• Path• Started with 2.x. Collection level locking• Now at 3.2. Document level locking• WiredTiger storage engine. 5x with snappy compression.

• Extreme convenience for configuration objects• Design patterns for time-series data• Great toolsets

• Shout out to Mongoid

• Easy data migration

Replica Sets

https://docs.mongodb.org/manual/core/replication-introduction/

• Multiple copies on servers• Provides fault tolerance• All writes to primary

• Secondaries replicate primary oplog.• Asynchronous replications

• Improved read performance• You can specify reading from a replica.

• Automatic failover• Election if the primary goes down

Sharding

https://docs.mongodb.org/manual/core/sharding-introduction/

• Horizontal scaling• Divides and distributes data over shards

• Entities• Shards store data• Query routers route requests to shards• Config servers. Metadata about the shards.

• Shard keys• Range based sharding. Efficient querying• Hash bases sharding. Efficient distribution

• Maintenance• Splitting and balancer

http://xkcd.com/

Questions

Email: ranjit@altizon.com

is hiring

Recommended