View
266
Download
6
Category
Preview:
Citation preview
Dataflow with Apache NiFiAldrin Piri - @aldrinpiriApache NiFi Crash CourseHadoop Summit 2016 – San Jose
29 June 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key: 'Apache NiFi’ Value: 'PMC Member'Key: 'Work’ Value: ’Sr. Member of Technical Staff @ Hortonworks'Key: 'Working with NiFi Since’ Value: '2010’
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect A to BProducers A.K.A Things
AnythingAND
Everything
Internet!
Consumers• User• Storage• System• …More Things
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Moving data effectively is hard
Standards: http://xkcd.com/927/
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is moving data effectively hard?
Standards Formats “Exactly Once” Delivery Protocols Veracity of Information Validity of Information Ensuring Security Overcoming Security
Compliance Schemas Consumers Change Credential Management “That [person|team|group]” Network “Exactly Once” Delivery
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕsLet’s consider the needs of a courier service
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Great! I am collecting all this data! Let’s use it!Finding our needles in the haystack
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark / Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is moving data effectively hard when scoped internally?
Standards Formats “Exactly Once” Delivery Protocols Veracity of Information Validity of Information Ensuring Security Overcoming Security
Compliance Schemas Consumers Change Credential Management “That [person|team|group]” Network “Exactly Once” Delivery
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕsOh, that courier service is global
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is moving data effectively hard when scoped globally?
Standards Formats “Exactly Once” Delivery Protocols Veracity of Information Validity of Information Ensuring Security Overcoming Security
Compliance Schemas Consumers Change Credential Management “That [person|team|group]” Network “Exactly Once” Delivery
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Unassuming Line: A Case StudyWe’ve seen a few lines show up in the wild thus far
Internet! Inter- & Intra- connections inour global courier enterprise
Spotlight: Arthur Lacôte, https://thenounproject.com/turo/
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 101Let’s dissect what this line typically represents
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or Application
Script or Application
Data Data
Disparate TransportMechanisms
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 201Sometimes that transport is just more lines
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or Application
Script or Application
Line Inception
Data Data
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 301But those lines could also have components…
Fig 1. Lineus Worldwidewebus. Common Name: Internet! Fig 2. Good Recursion Joke
NoSuchJokeException
footage not found
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFiKey Features
• Guaranteed delivery• Data buffering
- Backpressure- Pressure release
• Prioritized queuing• Flow specific QoS
- Latency vs. throughput- Loss tolerance
• Data provenance• Supports push and pull
models
• Recovery/recording a rolling log of fine-grained history
• Visual command and control
• Flow templates• Pluggable/multi-role
security• Designed for extension• Clustering
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Subproject: MiNiFi
Let me get the key parts of NiFi close to where data begins and provide bidrectional communication
NiFi lives in the data center. Give it an enterprise server or a cluster of them. MiNiFi lives as close to where data is born and is a guest on that device or system
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s revisit our courier service from the perspective of NiFi
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark / Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
Client Libraries
Client Libraries
MiNiFi
MiNiFiNiFi NiFi NiFi NiFi NiFi NiFi
Client Libraries
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Managed DataflowSOURCES REGIONAL
INFRASTRUCTURECORE
INFRASTRUCTURE
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi is based on Flow Based Programming (FBP)FBP Term NiFi Term DescriptionInformation Packet
FlowFile Each object moving through the system.
Black Box FlowFile Processor
Performs the work, doing some combination of data routing, transformation, or mediation between systems.
Bounded Buffer
Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.
Scheduler Flow Controller
Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.
Subnet Process Group
A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles & Data Agnosticism
NiFi is data agnostic! But, NiFi was designed understanding that users
can care about specifics and provides tooling
to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/
Robustness principle
Be conservative in what you do, be liberal in what you accept from others“
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles are like HTTP dataHTTP Data FlowFile
HTTP/1.1 200 OKDate: Sun, 10 Oct 2010 23:26:07 GMTServer: Apache/2.2.8 (CentOS) OpenSSL/0.9.8gLast-Modified: Sun, 26 Sep 2010 22:04:35 GMTETag: "45b6-834-49130cc1182c0"Accept-Ranges: bytesContent-Length: 13Connection: closeContent-Type: text/html
Hello world!
Standard FlowFile AttributesKey: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'fileSize’ Value: '23609'FlowFile Attribute Map ContentKey: 'filename’Value: '15650246997242'Key: 'path’ Value: './’
Binary Content *
Header
Content
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extension / Integration PointsNiFi Term DescriptionFlow File Processor
Push/Pull behavior. Custom UI
Reporting Task
Used to push data from NiFi to some external service (metrics, provenance, etc..)
Controller Service
Used to enable reusable components / shared services throughout the flow
REST API Allows clients to connect to pull information, change behavior, etc..
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
Architecture*
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
OS/Host
JVM
NiFi Cluster Manger – Request Replicator
Web Server
MasterNiFi Cluster Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
SlavesNiFi Nodes
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Architecture – Repositories - Pass by reference
FlowFile Content Provenance
F1 C1 C1 P1 F1
Excerpt of demo flow… What’s happening inside the repositories…
BEFORE
AFTER
F2 C1 C1 P3 F2 – Clone (F1)
F1 C1 P2 F1 – Route
P1 F1 – Create
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Architecture – Repositories – Copy on Write
FlowFile Content Provenance
F1 C1 C1 P1 F1 - CREATE
Excerpt of demo flow… What’s happening inside the repositories…
BEFORE
AFTER
F1 C1
F1.1 C2C2 (encrypted)
C1 (plaintext)
P2 F1.1 - MODIFY
P1 F1 - CREATE
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Demo
Community
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn, Share at Birds of a Feather Streaming, DataFlow & Cybersecurity
Thursday June 306:30 pm, Ballroom C
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why NiFi?
Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes– Think of our courier example and organizations like it: inter vs intra, domestically, internationally
Provide common tooling and extensions that are commonly needed but be flexible for extension– Leverage existing libraries and expansive Java ecosystem for functionality– Allow organizations to integrate with their existing infrastructure
Empower folks managing your infrastructure to make changes and reason about issues that are occurring– Data Provenance to show context and data’s journey– User Interface/Experience a key component
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn more and join us!
Apache NiFi sitehttp://nifi.apache.org
Subproject MiNiFi sitehttp://nifi.apache.org/minifi/
Subscribe to and collaborate atdev@nifi.apache.orgusers@nifi.apache.org
Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI
Follow us on Twitter@apachenifi
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Lab for Today
We will be exploring some examples to work through creating a dataflow with Apache NiFi
Use Case: An urban planning board is evaluating the need for a new highway, dependent on current traffic patterns, particularly as other roadwork initiatives are under way. Integrating live data poses a problem because traffic analysis has traditionally been done using historical, aggregated traffic counts. To improve traffic analysis, the city planner wants to leverage real-time data to get a deeper understanding of traffic patterns. NiFi was selected for for this real-time data integration.
Labs are available at http://tinyurl.com/nificrashcourse
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Recommended