Upload
bryan-bende
View
91
Download
0
Embed Size (px)
Citation preview
YouCan’tSearchWithoutDataBryanBende– StaffSoftwareEngineer@HortonworksNYCSolr/LuceneMeetup– December7th 2017
2 ©HortonworksInc.2011– 2016.AllRightsReserved
Agenda
à TheProblem
à ApacheNiFi Overview
à IntegrationbetweenNiFi &Solr
à Recent&FutureWork
à DemoCoolStuff!
à Q&A
3 ©HortonworksInc.2011– 2016.AllRightsReserved
AboutMe
à StaffSoftwareEngineer@Hortonworks
à ApacheNiFi PMC&Committer
à ContributedSolr processorsinMarch2015– https://issues.apache.org/jira/browse/NIFI-461
à [email protected] /Twitter@bbende /bryanbende.com
4 ©HortonworksInc.2011– 2016.AllRightsReserved
TheProblem
5 ©HortonworksInc.2011– 2016.AllRightsReserved
Team2
Itstartsoutsosimple…
Hey!Wehavesomeimportantdatato
sendyou!
Cool!Yourdataisreallyimportantto
us!
Team1
Thisshouldbeeasyright?...
6 ©HortonworksInc.2011– 2016.AllRightsReserved
Butwhataboutformats&protocols?
Team2
WecanpublishAvrorecordstoaKafkatopic,does
thatwork?
Oh,wellwehaveaRESTservicethataccepts
JSON…
Team1
7 ©HortonworksInc.2011– 2016.AllRightsReserved
Andwhataboutsecurity&authentication?
Team2
Hmmwhataboutsecurity?Wecanauthenticatevia
Kerberos
Sorry,weonlysupport2-Way
TLSwithcertificates
Team1
8 ©HortonworksInc.2011– 2016.AllRightsReserved
Andwhataboutallthesedevicesattheedge?
Wealsoneedtograbdatafromallthesedevices,howarewegoingtodo
that?
Team2
9 ©HortonworksInc.2011– 2016.AllRightsReserved
Wouldn’titbeniceiftherewasatoolthatcouldhelptheseteams?
10 ©HortonworksInc.2011– 2016.AllRightsReserved
EnterApacheNiFi…
11 ©HortonworksInc.2011– 2016.AllRightsReserved
Apache NiFi
• Created to address the challenges of global enterprise dataflow• Key features:
– VisualCommandandControl
– DataLineage(Provenance)
– DataPrioritization
– DataBuffering/Back-Pressure
– ControlLatencyvs.Throughput
– SecureControlPlane/DataPlane
– ScaleOutClustering
– Extensibility
12 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFi Core Concepts
FBPTerm NiFi Term DescriptionInformationPacket
FlowFile Each objectmovingthroughthesystem.
Black Box FlowFileProcessor
Performsthework, doingsomecombinationofdatarouting,transformation,ormediationbetweensystems.
BoundedBuffer
Connection Thelinkage betweenprocessors, actingasqueuesandallowingvariousprocessestointeractatdifferingrates.
Scheduler FlowController
Maintainstheknowledgeofhowprocessesareconnected, andmanagesthethreadsandallocationsthereofwhichallprocessesuse.
Subnet ProcessGroup
Asetofprocessesandtheirconnections,whichcanreceiveandsenddataviaports.Aprocess groupallowscreationofentirelynewcomponentsimplybycompositionofits components.
13 ©HortonworksInc.2011– 2016.AllRightsReserved
VisualCommand&Control
• Drag& dropprocessorstobuildaflow
• Start,stop,&configurecomponentsinreal-time
• Viewerrors& correspondingmessages
• Viewstatistics& healthof thedataflow
• Create shareable templatesofcommonflows
14 ©HortonworksInc.2011– 2016.AllRightsReserved
Provenance/Lineage
• Tracksdataateachpointasitflowsthroughthesystem
• Records,indexes,andmakeseventsavailablefordisplay
• Handlesfan-in/fan-out,i.e.mergingandsplittingdata
• Viewattributesandcontentatgivenpointsintime
15 ©HortonworksInc.2011– 2016.AllRightsReserved
Prioritization
• Configureaprioritizer perconnection
• Determinewhatisimportantforyourdata– timebased,arrivalorder,importanceofadataset
• Funnelmanyconnectionsdowntoasingleconnectiontoprioritizeacrossdatasets
16 ©HortonworksInc.2011– 2016.AllRightsReserved
Back-Pressure
• Configureback-pressureperconnection
• BasedonnumberofFlowFiles ortotalsizeofFlowFiles
• Upstreamprocessornolongerscheduledtorununtilbelowthreshold
17 ©HortonworksInc.2011– 2016.AllRightsReserved
Latencyvs.Throughput
• Choosebetweenlowerlatency,orhigherthroughputoneachprocessor
• Higherthroughputallowsframeworktobatchtogetheralloperationsfortheselectedamountoftimeforimprovedperformance
• Processordeveloperdetermineswhethertosupportthisbyusing@SupportsBatchingannotation
18 ©HortonworksInc.2011– 2016.AllRightsReserved
Architecture- Standalone
OS/Host
JVM
FlowController
WebServer
Processor1 ExtensionN
FlowFileRepository
ContentRepository
ProvenanceRepository
LocalStorage
à FlowFile Repository– WriteAheadLog– StateofeveryFlowFile– Pointerstocontentrepository
(pass-by-reference)
à ContentRepository– FlowFile content– Copy-on-write
à ProvenanceRepository– WriteAheadLog+Lucene Indexes– Store&searchlineageevents
19 ©HortonworksInc.2011– 2016.AllRightsReserved
OS/Host
JVM
FlowController
WebServer
Processor1 ExtensionN
FlowFileRepository
ContentRepository
ProvenanceRepository
LocalStorage
OS/Host
JVM
FlowController
WebServer
Processor1 ExtensionN
FlowFileRepository
ContentRepository
ProvenanceRepository
LocalStorage
Architecture- Cluster
OS/Host
JVM
FlowController
WebServer
Processor1 ExtensionN
FlowFileRepository
ContentRepository
ProvenanceRepository
LocalStorage
ZooKeeper
à Samedataflowoneachnode,datapartitionedacrosscluster
à AccesstheUIfromanynodeà ZooKeeper forauto-electionof
ClusterCoordinator&PrimaryNode
à ClusterCoordinatorreceivesheartbeatsfromothernodes,managesjoining/disconnecting
à PrimaryNodeforschedulingprocessorsonasinglenode
20 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFi &Solr
21 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFi Solr Processors
à SupportSolr Cloudandstand-aloneSolr instances
à LeverageSolrJ (CloudSolrClient &HttpSolrClient)
à GetSolr – Extractnewdocuments
à PutSolrContentStream – Streamflowfilecontenttoanupdatehandler
22 ©HortonworksInc.2011– 2016.AllRightsReserved
PutSolrContentStream
à ChooseSolr Type– CloudorStandard
à SpecifyZooKeeper hosts,ortheSolr URLwithcore
à SpecifytheSolr pathfortheupdatehandler
à DynamicPropertiessentaskey/valuepairsonrequest
à Relationshipsforsuccess,failure,andconnectionfailure
23 ©HortonworksInc.2011– 2016.AllRightsReserved
GetSolr
à Incrementallyextractnewdocuments
à Mainqueryis*:*,Solr Queryisoptionalfilterquery
à DateFieldusedasfilterquery,fromlastexecutionorinitialvalue
à Sortedbydatefieldanduniquekey
à Cursormarkusedbehindthescenes
à Specifyreturnfields,orallifblank
à OutputSolr XML,orRecords
24 ©HortonworksInc.2011– 2016.AllRightsReserved
InteractingWithaSecureSolr
à BasicAuth– Providerusername/password
à Kerberos– SetJAASsystempropertyinbootstrap.conf– ProvidenameofJAASentryforprocessortouse
à TLS/SSL– ProvideanSSLContextService– One-wayTLSwithTruststore only– Two-wayTLSwithKeystore +Truststore
25 ©HortonworksInc.2011– 2016.AllRightsReserved
Recent&FutureWork
26 ©HortonworksInc.2011– 2016.AllRightsReserved
Problem– ConversionBetweenDataFormats
à Specializedprocessorstooperateondifferentdatatypes
à Sometimesmissingconversions
à Sometimesmissingaspecificfunctionforadatatype
à Sometimesimplementedwithdifferentlibrariescausinginconsistencies
27 ©HortonworksInc.2011– 2016.AllRightsReserved
Solution– RecordProcessing
à Introducetheconceptofa”record”– ReleasedinApacheNiFi 1.2.0(May2017),improvementsin1.3.0and1.4.0
à Centralizethelogicforreading/writingrecordsintocontrollerservices– Readers/WritersforCSV,Json,Avro,etc.
à Providestandardprocessorsthatoperateonrecords– ConvertRecord,QueryRecord,PartitionRecord,UpdateRecord,etc.
à Provideintegrationwithschemaregistries– LocalSchemaRegistry,HortonworksSchemaRegistry,ConfluentSchemaRegistry
à Canstillhandlearbitrarydata,butprocessrecordswhenappropriate
28 ©HortonworksInc.2011– 2016.AllRightsReserved
Problem– VariableHandling
à Needtoparametrizevaluesintheflowperenvironment– Connectionstrings,URLs,FileSystempaths,etc.
à Cansetvariablesinbootstrap.conf– -Dmy.var=foo
à Cansetapropertiesfileinnifi.properties– nifi.variable.registry.properties=production.properties
à Bothrequirecommandlineaccess
à Bothrequirerestarttopickupchanges
29 ©HortonworksInc.2011– 2016.AllRightsReserved
Solution– FirstClassVariableRegistry
à Variablesassociatedwithaprocessgroup,releasedin1.4.0
à Right-clickoncanvastoviewvariablesforcurrentgroup
à Hierarchicalorderofprecedence,resolveclosestreferencetocomponent
à Editingvariablesautomaticallyrestartsanycomponentsreferencingthevariables
30 ©HortonworksInc.2011– 2016.AllRightsReserved
Problem– HowdoIdeploymyflow?
à Mostorganizationswanttheclassicdevelopmentlifecycle(dev->int ->prod)
à Cancopyflow.xml.gz betweenenvironments– Requirescopyingentiredataflow– Can’ttellwhatchanged,hardtodiffifyouputinversioncontrol– Requiresallenvironmentsusethesameencryptionkeyforsensitiveproperties
à Canmaketemplatesforportionsoftheflow– Scriptcreationoftemplateanddeploymenttonextenvironment– Requiresstoppingflowandremovingcomponents,thenre-instantiatingtemplate– Noeasywaytoseechanges,hardtorollback
31 ©HortonworksInc.2011– 2016.AllRightsReserved
Solution– NiFi Registry
à DISCLAIMER- UNDERDEVELOPMENT&NOTRELEASEDYET!
à Complimentaryapplication,sub-projectofApacheNiFi– https://github.com/apache/nifi-registry– https://issues.apache.org/jira/projects/NIFIREG
à Centrallocationforstorage/managementofsharedresourcesacrossNiFi instances
à Initialcapabilitytostoreandretrieve“versionedflows”
à Aversionedflowisasnapshotofaprocessgroupatagivenpointintime
à Potentiallystoreextensions,shareddatasets,andmoreinthefuture
32 ©HortonworksInc.2011– 2016.AllRightsReserved
DEMO!!
33 ©HortonworksInc.2011– 2016.AllRightsReserved
ExampleScenario
à Userdata– https://randomuser.me
à InitiallyinCSVformat– name.title,name.first,name.last,email,registered– mr,dennis,reyes,[email protected],2012-04-10 01:54:19– miss,carole,gomez,[email protected],2002-12-17 22:15:49
à Requirements– ConvertCSVtoJSON– Addafull_name fieldwithfirstname+lastname– Addagenderfieldbasedontitle(i.e.iftitle==mr thenMALE)– IngesttodifferentSolr collectionsdependingonenvironment
34 ©HortonworksInc.2011– 2016.AllRightsReserved
Questions?
35 ©HortonworksInc.2011– 2016.AllRightsReserved
Learnmoreandjoinus!
Apache NiFi sitehttp://nifi.apache.org
Subscribe to and collaborate [email protected]@nifi.apache.org
Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI
Follow us on Twitter@apachenifi
36 ©HortonworksInc.2011– 2016.AllRightsReserved
Thankyou!