Building A Scalable Big Data System for the Internet of Things (IoT))


The Internet of Things


My Motivation


IoT is the integration of the physical world into the computing world

IoT is a Big-Data problem

• Massive amounts of data

• Machines are commonly sampled for data at millisecond intervals.

• Volume, variety and velocity.

• That needs to be analyzed in real time

• Condition based monitoring

• Anomaly detection

• That need to be analyzed for actionable insights

• Efficiency, utilization, machine health

• That need supervised and unsupervised machine learning

• Predictive maintenance, proactive support


The Edge


• Network and connectivity

• Wifi, BLE, Zigbee, 6LoWPAN

• Protocols


• Low Complexity

• Security

• Upgrades


• Network Protocols

• Higher Complexity

• Bidirectional communication

• Integration

• Device context

• Security

Be cloud agnostic

The Cloud Edge

• Protocol Adapters• Edge to cloud protocols

• Filtering rules and aggregations• Batching• Local controller• Highly available• Load balanced

Event Ingestion at Scale

• Device auto-discovery• Metadata driven device discovery

• Device Telemetry Data• Time-series data• Which can be out of sequence

• Alerts and logs• Event validation• Bandwidth and backpressure

• Portable deployment of applications as a single

object versus process sandboxing• Application-centric versus machine/server-

centric• Supports for automatic container builds• Built-in version tracking• Reusable components• Public registry for sharing containers• A growing tools ecosystem from the published


backend datonis-events balance source server event1 check server event2 check

backend datonis-api balance roundrobin mode http server api1 check server api2 check

frontend http bind *:80 mode http

acl events path_beg /event use_backend datonis-events if events

default_backend datonis-api


• Entities• Broker, Topic, Producer, Consumer

• A sharded write ahead log• Contiguous memory allocation• Index and offset• Messages are not deleted on read

• But on an SLA• Data reloads

• Log replication for fault tolerance• Making reads faster

• Kafka-Spark consumer

• Caching.

• Redis can be used in the same manner as memcache• Counting stuff. Atomic counters • Show latest items.

• This is a live in-memory cache and is very fast. • Deletion and filtering.

• If a cached article is deleted it can be removed from the cache using.

• Leaderboards and related problems. • Implement expires on items.• Unique N items in a given amount of time. • Pub/Sub. • Queues.

Real time CEP

• Apache Spark• Unified stream, batch processing and

machine learning

• RDDs• Immutable, resilient, distributed collection

of records.

• DStreams• A continuous sequence of RDDs

val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

val ssc = new StreamingContext(sparkConf, Seconds(1))

val lines = ssc.socketTextStream(args(0), args(1))val words = lines.flatMap(_.split(" "))val wordCounts = => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()ssc.awaitTermination()


Spark Streaming

Actually this is CEP

Why do we love Spark

• Common logic for stream and batch processing• No separate architectures and approaches• Storm would have been appropriate for absolute real-time

• A hit with data-scientists• Rapid iterations on large data sets

• Language support• Python, Java, Scala and R• R syntax is extremely baffling (or maybe I’m just too old)

• Spark MLIB• Statistics, classification, filtering, clustering, feature extraction • The list is constantly growing


• Why Mongo?• Concerns around separate databases for transactional data and

event data• Premature optimization

• Path• Started with 2.x. Collection level locking• Now at 3.2. Document level locking• WiredTiger storage engine. 5x with snappy compression.

• Extreme convenience for configuration objects• Design patterns for time-series data• Great toolsets

• Shout out to Mongoid

• Easy data migration

Replica Sets

• Multiple copies on servers• Provides fault tolerance• All writes to primary

• Secondaries replicate primary oplog.• Asynchronous replications

• Improved read performance• You can specify reading from a replica.

• Automatic failover• Election if the primary goes down


• Horizontal scaling• Divides and distributes data over shards

• Entities• Shards store data• Query routers route requests to shards• Config servers. Metadata about the shards.

• Shard keys• Range based sharding. Efficient querying• Hash bases sharding. Efficient distribution

• Maintenance• Splitting and balancer



