36
You Can’t Search Without Data Bryan Bende – Staff Software Engineer @Hortonworks NYC Solr/Lucene Meetup – December 7 th 2017

You Can't Search Without Data

Embed Size (px)

Citation preview

Page 1: You Can't Search Without Data

YouCan’tSearchWithoutDataBryanBende– StaffSoftwareEngineer@HortonworksNYCSolr/LuceneMeetup– December7th 2017

Page 2: You Can't Search Without Data

2 ©HortonworksInc.2011– 2016.AllRightsReserved

Agenda

à TheProblem

à ApacheNiFi Overview

à IntegrationbetweenNiFi &Solr

à Recent&FutureWork

à DemoCoolStuff!

à Q&A

Page 3: You Can't Search Without Data

3 ©HortonworksInc.2011– 2016.AllRightsReserved

AboutMe

à StaffSoftwareEngineer@Hortonworks

à ApacheNiFi PMC&Committer

à ContributedSolr processorsinMarch2015– https://issues.apache.org/jira/browse/NIFI-461

à [email protected] /Twitter@bbende /bryanbende.com

Page 4: You Can't Search Without Data

4 ©HortonworksInc.2011– 2016.AllRightsReserved

TheProblem

Page 5: You Can't Search Without Data

5 ©HortonworksInc.2011– 2016.AllRightsReserved

Team2

Itstartsoutsosimple…

Hey!Wehavesomeimportantdatato

sendyou!

Cool!Yourdataisreallyimportantto

us!

Team1

Thisshouldbeeasyright?...

Page 6: You Can't Search Without Data

6 ©HortonworksInc.2011– 2016.AllRightsReserved

Butwhataboutformats&protocols?

Team2

WecanpublishAvrorecordstoaKafkatopic,does

thatwork?

Oh,wellwehaveaRESTservicethataccepts

JSON…

Team1

Page 7: You Can't Search Without Data

7 ©HortonworksInc.2011– 2016.AllRightsReserved

Andwhataboutsecurity&authentication?

Team2

Hmmwhataboutsecurity?Wecanauthenticatevia

Kerberos

Sorry,weonlysupport2-Way

TLSwithcertificates

Team1

Page 8: You Can't Search Without Data

8 ©HortonworksInc.2011– 2016.AllRightsReserved

Andwhataboutallthesedevicesattheedge?

Wealsoneedtograbdatafromallthesedevices,howarewegoingtodo

that?

Team2

Page 9: You Can't Search Without Data

9 ©HortonworksInc.2011– 2016.AllRightsReserved

Wouldn’titbeniceiftherewasatoolthatcouldhelptheseteams?

Page 10: You Can't Search Without Data

10 ©HortonworksInc.2011– 2016.AllRightsReserved

EnterApacheNiFi…

Page 11: You Can't Search Without Data

11 ©HortonworksInc.2011– 2016.AllRightsReserved

Apache NiFi

• Created to address the challenges of global enterprise dataflow• Key features:

– VisualCommandandControl

– DataLineage(Provenance)

– DataPrioritization

– DataBuffering/Back-Pressure

– ControlLatencyvs.Throughput

– SecureControlPlane/DataPlane

– ScaleOutClustering

– Extensibility

Page 12: You Can't Search Without Data

12 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi Core Concepts

FBPTerm NiFi Term DescriptionInformationPacket

FlowFile Each objectmovingthroughthesystem.

Black Box FlowFileProcessor

Performsthework, doingsomecombinationofdatarouting,transformation,ormediationbetweensystems.

BoundedBuffer

Connection Thelinkage betweenprocessors, actingasqueuesandallowingvariousprocessestointeractatdifferingrates.

Scheduler FlowController

Maintainstheknowledgeofhowprocessesareconnected, andmanagesthethreadsandallocationsthereofwhichallprocessesuse.

Subnet ProcessGroup

Asetofprocessesandtheirconnections,whichcanreceiveandsenddataviaports.Aprocess groupallowscreationofentirelynewcomponentsimplybycompositionofits components.

Page 13: You Can't Search Without Data

13 ©HortonworksInc.2011– 2016.AllRightsReserved

VisualCommand&Control

• Drag& dropprocessorstobuildaflow

• Start,stop,&configurecomponentsinreal-time

• Viewerrors& correspondingmessages

• Viewstatistics& healthof thedataflow

• Create shareable templatesofcommonflows

Page 14: You Can't Search Without Data

14 ©HortonworksInc.2011– 2016.AllRightsReserved

Provenance/Lineage

• Tracksdataateachpointasitflowsthroughthesystem

• Records,indexes,andmakeseventsavailablefordisplay

• Handlesfan-in/fan-out,i.e.mergingandsplittingdata

• Viewattributesandcontentatgivenpointsintime

Page 15: You Can't Search Without Data

15 ©HortonworksInc.2011– 2016.AllRightsReserved

Prioritization

• Configureaprioritizer perconnection

• Determinewhatisimportantforyourdata– timebased,arrivalorder,importanceofadataset

• Funnelmanyconnectionsdowntoasingleconnectiontoprioritizeacrossdatasets

Page 16: You Can't Search Without Data

16 ©HortonworksInc.2011– 2016.AllRightsReserved

Back-Pressure

• Configureback-pressureperconnection

• BasedonnumberofFlowFiles ortotalsizeofFlowFiles

• Upstreamprocessornolongerscheduledtorununtilbelowthreshold

Page 17: You Can't Search Without Data

17 ©HortonworksInc.2011– 2016.AllRightsReserved

Latencyvs.Throughput

• Choosebetweenlowerlatency,orhigherthroughputoneachprocessor

• Higherthroughputallowsframeworktobatchtogetheralloperationsfortheselectedamountoftimeforimprovedperformance

• Processordeveloperdetermineswhethertosupportthisbyusing@SupportsBatchingannotation

Page 18: You Can't Search Without Data

18 ©HortonworksInc.2011– 2016.AllRightsReserved

Architecture- Standalone

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

à FlowFile Repository– WriteAheadLog– StateofeveryFlowFile– Pointerstocontentrepository

(pass-by-reference)

à ContentRepository– FlowFile content– Copy-on-write

à ProvenanceRepository– WriteAheadLog+Lucene Indexes– Store&searchlineageevents

Page 19: You Can't Search Without Data

19 ©HortonworksInc.2011– 2016.AllRightsReserved

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

Architecture- Cluster

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

ZooKeeper

à Samedataflowoneachnode,datapartitionedacrosscluster

à AccesstheUIfromanynodeà ZooKeeper forauto-electionof

ClusterCoordinator&PrimaryNode

à ClusterCoordinatorreceivesheartbeatsfromothernodes,managesjoining/disconnecting

à PrimaryNodeforschedulingprocessorsonasinglenode

Page 20: You Can't Search Without Data

20 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi &Solr

Page 21: You Can't Search Without Data

21 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi Solr Processors

à SupportSolr Cloudandstand-aloneSolr instances

à LeverageSolrJ (CloudSolrClient &HttpSolrClient)

à GetSolr – Extractnewdocuments

à PutSolrContentStream – Streamflowfilecontenttoanupdatehandler

Page 22: You Can't Search Without Data

22 ©HortonworksInc.2011– 2016.AllRightsReserved

PutSolrContentStream

à ChooseSolr Type– CloudorStandard

à SpecifyZooKeeper hosts,ortheSolr URLwithcore

à SpecifytheSolr pathfortheupdatehandler

à DynamicPropertiessentaskey/valuepairsonrequest

à Relationshipsforsuccess,failure,andconnectionfailure

Page 23: You Can't Search Without Data

23 ©HortonworksInc.2011– 2016.AllRightsReserved

GetSolr

à Incrementallyextractnewdocuments

à Mainqueryis*:*,Solr Queryisoptionalfilterquery

à DateFieldusedasfilterquery,fromlastexecutionorinitialvalue

à Sortedbydatefieldanduniquekey

à Cursormarkusedbehindthescenes

à Specifyreturnfields,orallifblank

à OutputSolr XML,orRecords

Page 24: You Can't Search Without Data

24 ©HortonworksInc.2011– 2016.AllRightsReserved

InteractingWithaSecureSolr

à BasicAuth– Providerusername/password

à Kerberos– SetJAASsystempropertyinbootstrap.conf– ProvidenameofJAASentryforprocessortouse

à TLS/SSL– ProvideanSSLContextService– One-wayTLSwithTruststore only– Two-wayTLSwithKeystore +Truststore

Page 25: You Can't Search Without Data

25 ©HortonworksInc.2011– 2016.AllRightsReserved

Recent&FutureWork

Page 26: You Can't Search Without Data

26 ©HortonworksInc.2011– 2016.AllRightsReserved

Problem– ConversionBetweenDataFormats

à Specializedprocessorstooperateondifferentdatatypes

à Sometimesmissingconversions

à Sometimesmissingaspecificfunctionforadatatype

à Sometimesimplementedwithdifferentlibrariescausinginconsistencies

Page 27: You Can't Search Without Data

27 ©HortonworksInc.2011– 2016.AllRightsReserved

Solution– RecordProcessing

à Introducetheconceptofa”record”– ReleasedinApacheNiFi 1.2.0(May2017),improvementsin1.3.0and1.4.0

à Centralizethelogicforreading/writingrecordsintocontrollerservices– Readers/WritersforCSV,Json,Avro,etc.

à Providestandardprocessorsthatoperateonrecords– ConvertRecord,QueryRecord,PartitionRecord,UpdateRecord,etc.

à Provideintegrationwithschemaregistries– LocalSchemaRegistry,HortonworksSchemaRegistry,ConfluentSchemaRegistry

à Canstillhandlearbitrarydata,butprocessrecordswhenappropriate

Page 28: You Can't Search Without Data

28 ©HortonworksInc.2011– 2016.AllRightsReserved

Problem– VariableHandling

à Needtoparametrizevaluesintheflowperenvironment– Connectionstrings,URLs,FileSystempaths,etc.

à Cansetvariablesinbootstrap.conf– -Dmy.var=foo

à Cansetapropertiesfileinnifi.properties– nifi.variable.registry.properties=production.properties

à Bothrequirecommandlineaccess

à Bothrequirerestarttopickupchanges

Page 29: You Can't Search Without Data

29 ©HortonworksInc.2011– 2016.AllRightsReserved

Solution– FirstClassVariableRegistry

à Variablesassociatedwithaprocessgroup,releasedin1.4.0

à Right-clickoncanvastoviewvariablesforcurrentgroup

à Hierarchicalorderofprecedence,resolveclosestreferencetocomponent

à Editingvariablesautomaticallyrestartsanycomponentsreferencingthevariables

Page 30: You Can't Search Without Data

30 ©HortonworksInc.2011– 2016.AllRightsReserved

Problem– HowdoIdeploymyflow?

à Mostorganizationswanttheclassicdevelopmentlifecycle(dev->int ->prod)

à Cancopyflow.xml.gz betweenenvironments– Requirescopyingentiredataflow– Can’ttellwhatchanged,hardtodiffifyouputinversioncontrol– Requiresallenvironmentsusethesameencryptionkeyforsensitiveproperties

à Canmaketemplatesforportionsoftheflow– Scriptcreationoftemplateanddeploymenttonextenvironment– Requiresstoppingflowandremovingcomponents,thenre-instantiatingtemplate– Noeasywaytoseechanges,hardtorollback

Page 31: You Can't Search Without Data

31 ©HortonworksInc.2011– 2016.AllRightsReserved

Solution– NiFi Registry

à DISCLAIMER- UNDERDEVELOPMENT&NOTRELEASEDYET!

à Complimentaryapplication,sub-projectofApacheNiFi– https://github.com/apache/nifi-registry– https://issues.apache.org/jira/projects/NIFIREG

à Centrallocationforstorage/managementofsharedresourcesacrossNiFi instances

à Initialcapabilitytostoreandretrieve“versionedflows”

à Aversionedflowisasnapshotofaprocessgroupatagivenpointintime

à Potentiallystoreextensions,shareddatasets,andmoreinthefuture

Page 32: You Can't Search Without Data

32 ©HortonworksInc.2011– 2016.AllRightsReserved

DEMO!!

Page 33: You Can't Search Without Data

33 ©HortonworksInc.2011– 2016.AllRightsReserved

ExampleScenario

à Userdata– https://randomuser.me

à InitiallyinCSVformat– name.title,name.first,name.last,email,registered– mr,dennis,reyes,[email protected],2012-04-10 01:54:19– miss,carole,gomez,[email protected],2002-12-17 22:15:49

à Requirements– ConvertCSVtoJSON– Addafull_name fieldwithfirstname+lastname– Addagenderfieldbasedontitle(i.e.iftitle==mr thenMALE)– IngesttodifferentSolr collectionsdependingonenvironment

Page 34: You Can't Search Without Data

34 ©HortonworksInc.2011– 2016.AllRightsReserved

Questions?

Page 35: You Can't Search Without Data

35 ©HortonworksInc.2011– 2016.AllRightsReserved

Learnmoreandjoinus!

Apache NiFi sitehttp://nifi.apache.org

Subscribe to and collaborate [email protected]@nifi.apache.org

Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI

Follow us on Twitter@apachenifi

Page 36: You Can't Search Without Data

36 ©HortonworksInc.2011– 2016.AllRightsReserved

Thankyou!