Upload
jungtaek-lim
View
192
Download
10
Embed Size (px)
Citation preview
Introduction to Apache NiFi & Storm
Jungtaek Lim
WHO AM I?• Staff Software Engineer @ Hortonworks
• remote worker
• Open source prosumer
• Committer of Jedis
• PMC member of Apache Storm
• Contributor of Apache (Spark, Zeppelin, Ambari, Calcite), Redis, and so on.
• Contact
• Twitter / LinkedIn / Github / Facebook
• @heartsavior
CoreInfrastructureSources
à ConstrainedÃHigh-latencyà Localizedcontext
ÃHybrid– cloud/on-premisesà Low-latencyÃGlobalcontext
RegionalInfrastructure
DATA IN MOTION IN HORTONWORKS DATAFLOW (HDF)
Source: http://ko.hortonworks.com/products/data-center/hdf/
What is Apache NiFi?
An easy to use, powerful, and reliable system to process and distribute data.
History of Apache NiFi
• Created by the United States National Security Agency (NSA)
• originally named Niagarafiles
• In 2014 the NSA submitted the source code to Apache Software Foundation, via the NSA Technology Transfer Program, entered incubation in December 2014
• Development of Apache NiFi continued at Onyara, Inc., a start up company
• Became Apache Top-Level Project in July 2015
• Hortonworks acquired Onyara, Inc. in August 2015
Role of Apache NiFi
• Data acquisition and delivery
• Simple transformation and data routing
• Simple event processing
• End to end provenance
• Edge intelligence and bi-directional comms.
NOT intended to REPLACE ‘distribute computation engines’
(a.k.a streaming processing frameworks)
Features of Apache NiFi
Highly configurable
• Loss tolerant vs guaranteed delivery
• Low latency vs high throughput
• Dynamic prioritization
• Flow can be modified at runtime
• Back pressure
More…• Designed for extension
• Build your own processors and more
• Secure
• SSL, SSH, HTTPS, encrypted content, etc...
• Multi-tenant authorization and internal authorization/policy management
• MiNiFi subproject
• Reduce footprint to ~ 40 MB
What is Apache Storm?
A free and open source distributed realtime computation system.
History of Apache Storm
Source: http://hortonworks.com/blog/brief-history-apache-storm/
Concepts of Apache Storm
• Spout: a source of streams in a topology
• Bolt: a processing component which includes Sink
• Stream: an unbounded sequence of tuples, defined with schema
• Stream groupings: defines how that stream should be partitioned among the bolt's tasks
• Topology: the logic for a realtime application represented to a DAG
Core vs Trident
Core Trident
Computation Unit Record (tuple) Micro batch
Latency Very low (sub-seconds) High (up to batch size)Similar to Spark Streaming
Delivery Guarantee At least once Exactly once
API Compositional Declarative
Stateful Operator Supported from v1.0.0 Core feature(exactly-once)
Windowing Time (processing time, event time), CountTumbling window, Sliding window
Features of Apache Storm
• Supports number of connectors (17 connectors in master branch)
• Automatic back-pressure
• Distributed Cache
• Flux (constructing topology via yaml)
• Distributed Log Search
• Dynamic Worker Profiling
• Dynamic Log Levels
• Topology Event Inspector
• Resource Aware Scheduler
• SQL (Experimental)
Future of Apache StormApache Storm 2.0 and beyond
• Clojure to Java translation
• Unified Stream API with supporting exactly-once
• Rework Metrics feature
• Apache Beam runner
• Streaming SQL with Apache Calcite
• And more…
• Performance
• Usability
THANKS!Any questions?
Appendix A. Apache NiFi
NiFi EvaluateJsonPath / RouteOnAttribute configuration
NiFi PutHDFS / PublishKafka configuration
NiFi Queue options – Status History
NiFi Queue options – List queue
NiFi Data Provenance
Appendix B. Apache Storm
Distributed Log Search
Dynamic Worker Profiling
Dynamic Log Levels
Topology Event Inspector
Resource Aware SchedulerSource:ResourceAwareSchedulinginApacheStorm,HadoopSummitSanJose2016