SplunkLive Sydney Scaling and best practice for Splunk on premise and in the cloud

Splunkscaling&bestpractice

NicovanderWaltClientArchitect, Splunk

Copyright©2016SplunkInc.

Introduction3TierApproachForwardingArchitectureIndexingArchitectureSearchArchitectureSizingRecapSizingExamplesMonitoringQ&A

AGENDA

3TierApproach

SizingConsiderations

VitalInfo• Amountofincomingdata• Amountofindexed(stored)data• Numberofconcurrentusers• Numberofsavedsearches• Typesofsearches• SpecificSplunkapps

http://docs.splunk.com/Documentation/Splunk/latest/Installation/Performancechecklist

Splunk3TierArchitecture

5

Enterprise-classScale,ResilienceandInteroperability

SenddatafromthousandsofserversusinganycombinationofSplunkforwarders

Autoload-balancedforwardingtoSplunkIndexers

OffloadsearchloadtoSplunkSearchHeads

ReferenceHardware

Allinstancesx64,CPU>2Ghzpercore*http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware

†http://docs.splunk.com/Documentation/ES/latest/Install/DeploymentPlanning

6

Role CoreSplunk* EnterpriseSecurity(ES) †

Indexer12CPUcores12GBofRAM800IOPS/indexerRAID1+0dataingest:150-250GB/day

16CPUcores32GBofRAM800IOPS/indexerRAID1+0dataingest:100GB/day

SearchHead16CPUcores12GBofRAM2x300GB10krpmSASinRAID1

16CPUcores32GBofRAM2x300GB10krpmSASinRAID1

RequiredReading

DistributedDeploymentManual• http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Distributedoverview

Highlights• Referencehardwarespecs• Howsearchesaffectperformance• Dense/Rare/Sparse

• Appconsiderations• Summarytable

7

ForwardingArchitecture

ForwardingTierDesignFactorsSyslogCollectors(HA)DBConnectInputs

• Eg.McAfeeEPOdata

TAInputs• Eg.CheckPoint

AssortedInputs• MicrosoftADlogs• MicrosoftExchangeServer• MicrosoftSharepointlogs• Log4j,Linux,IIS• …

9

SyslogCollectors

• Bestpracticetousededicatedsyslogservers• Syslog-NG/rSyslogrecommended• Syslogcanwriteeventstodedicatedlogfilesallowingforeasysourcetypeclassification

oninputs

10

SyslogCollectors

UsingaLoadBalancer/VIPwithLinuxHeartbeattoprovidefailoverforthesysloglistenerSyslog-NGPEClient-sidefailover

11

Syslog-NG Server Syslog-NG Server

Syslog 514/tcp & 514/udp

Router (Physical)

Load Balancer

Load Balancer

ForwarderforTA’s

TA-McAfeerequiresDBConnecttopullendpointeventsTA-CheckpointusestheLEAClienttoretrieveFirewalllogeventsNotaHAdesign,butcouldbehostedonaVMtostandbyorfailover

12

Heavy Forwarder, Linux

ePO Database

Checkpoint Server

TA-McAfee(DBConnect)

TA-Checkpoint

Firewall

DeploymentServer

Deployment Server

Splunk Forwarders to get apps from splkds.internal.door2door.com:8089

13

● DeploymentServertomanageLinuxandWindowsforwarders

● NotaHAdesign,butcouldbehostedonaVMtostandbyorfailover

ForwardingTier

Syslog-NG ServerForwarders, LinuxForwarders,

Windows

Deployment Server

Windows SharePoint Server

Heavy Forwarder, Linux

ePO Database

Checkpoint Server

Windows AD ServerSyslog-NG Server

Indexers

Syslog 514/tcp & 514/udp

TA-McAfee(DBConnect)

TA-Checkpoint

Splunk AutoLB to splkidx.internal.door2door.com:9997Splunk Forwarders to get apps from splkds.internal.door2door.com:8089

Router (Physical)

Load Balancer

Load Balancer

Firewall

14

ForwardingTierDesignBestPractices

UseaSyslogServerforSyslogdataBecarefulwithIntermediateforwarders• Theycanintroducebottlenecks• ReducethedistributionofeventsacrossIndexers

MayneedtoincreaseUFthruputsettingforhighvelocitysources• [thruput]• maxKBps

AutoLBwillspreadoverallavailableindexers,butdon’tassumeevenly!• EnableforceTimebasedAutoLBforUF

15

DataDistributionImbalanceEvendatadistributioniscrucialinparallelcomputingWaystoimprovedatadistribution:

• Enableparallelpipelines onheavyforwarders(Inserver.conf)• RoutedirectlyfromUniversalforwarderswherepossible• Makethefollowing changestoforwarders’outputs.conf:

• forceTimebasedAutoLB=true• autoLBFrequency=x

Examinesavedsearchtimewindows.Examplebelowhasmanysearchesovera5minutewindow, andsomesearchesover1minutewindow,autoLBFrequencytimesnumberofindexersshould bedivisible by5minutes, or1minuteifpossible

|tstats summariesonly=tcountWHEREindex=“*” bysplunk_server_time |timechart span=5msum(count) bysplunk_server

16

6Indexers;autoLBFrequency=30Unevendistribution ofworkloadover5minuteperiods.Unpredictableworkloadvariation

6Indexers;autoLBFrequency=15Betterdistributionover5minutes.autoLBFrequency=10wouldbeevenbetterasthereare6indexers

DataImbalance- Troubleshoot

Troubleshooting:• Validatefirewallrulesareinplace• Checkthatallforwardershavethecorrectoutputs• Ensureindexersallalllisteningonproperport• Doessplunkd.loghaveanythingtosay?• UsetheIndexingOverviewandConfigurationOverview(btoolsavestheday)

OtherCauses:• Simplemisconfiguration• Dataprocessingqueuesfillingupandforwarderstimingoutandjumpingtonextindexer

• CheckDistributedIndexingPerformanceintheDMCforqueuefilling- typicalsignofdiskperformanceissues

• Indexeraffinity- theforwardersgetstucktooneindexerbecauseEOFnevermet• forceTimebasedAutoLBcanhelp!http://blogs.splunk.com/2014/03/18/time-based-load-balancing/

17

HowManyDeploymentServers?

Ruleofthumbsays:1per10kclients@10– 15minpollingperiodAdjustpollingperiodtoincreasetotalclientssupportedSmalldeploymentscansharethesameinstanceasothermanagementinstances(LicenseMaster,ClusterMaster,etc.)Lowrequirementfordiskperformance(goodcandidateforvirtualization)Orusesomethingotherthandeploymentserver• puppet,SCCM,cfengine,chef…

IndexingArchitecture

IndexingTier

DesignFactorsPeakingestvolumeHighAvailability– IndexerReplication10%DiskSpaceContingencyDataretention

ClusterSizingCalculatorhttp://splunk-sizing.appspot.com

20

HowManyIndexers?

Ruleofthumbsays:1indexerper150- 250GB/day80– 100GBwithEnterpriseSecurity

Leaveroomfor:• Dailypeaks

Needmoreindexersfor:• Heavyreporting•Moreusers• Slowerdisks,slowerCPUs,fewerCPUs

StorageCalculations

RAIDConfiguration• Amountofrawdisk• Faulttolerance• AvailableIOPS

FilesystemOverhead• inodesconsumespace

Wiggleroom• Additionalreplicatedbucketswhenanodefails• Unbalancedreplicatedbuckets

Splunkinternallogs,SummaryIndexes,ReportAcceleration,AcceleratedDataModels

22

StorageTypes

LocalvsDirectAttachedvsSANvsNASSSD/FlashvsSpinningDisk• SSDsoffermuchhigherIOPSwithnolatency• SignificantperformanceincreaseswithSparseSearches

23

IndexReplication(akaIndexClustering)Whatisit?

• Dataisreplicatedto1ormoreindexersbasedonindexes• SplunkClusterMastercontrolled

Basics• MasterNode(managesindexingandsearchinglocation)• HorizontalScaling

HAvsDR• HA- Dataismadeavailableon1ormoreindexersinonelocation• DR– Multisite clustering.Alldataexists inmultiple locations

BenefitsofClustering

• Dataredundancy• Dataavailability• Indexerresiliency• Simplermanagementofindexers• Simplersetupofdistributedsearch• Multi-siteclusteringallowssite-specificsearchtoreduceWANtraffic

25

IndexClusteringSizingReplicationfactorüDeterminethenumberofrebuildablecopiesofdatatomaintain

SearchfactorüDeterminethenumberofsearchablecopiesofthedata

DataRetentionequationbasedonsyslogdataü TotaldiskusageacrossclusterinGB=(RepFactor*0.15+SearchFactor*0.35)*DatasetSizeGB

IncreaseinI/O,CPU,anddiskrequirement• Meansdailyindexingvolumeperserverwillbelower

Searchfactorincreasediskusageby~30%(rawdata+tsidx)Replicationfactorincreasesdiskusageby~10%(onlyrawdata)

ClusterMasterServer

• IndexerAppsaredeployedviaCM• NotaHAdesign,butcouldbehostedonaVMtostandbyorfailover

27

IndexingTier

Master Cluster Node

28

SearchArchitecture

SearchTier

DesignFactors• HighAvailability• SearchHeadClustering• #users• #concurrentsearches• Forwardalldatatoindexers

30

SearchHeadClustering

Whatisit?• Groupsearchheadsintoaclusterasasingleentity• ProvidesHAattheSearchHeadlayer• SplunkHeadCaptaincontrolled• RAFTprotocoltopickcaptain

Basics• Acaptaingetselecteddynamically(pre6.3)orcanbedefinedmanually(6.3)• Knowledgeobjectsandsearchartifactsarereplicated• Searchworkloaddistribution• ReplicationusinglocalstorageNOToverNFS

SHC&Deployer

• SearchHeadClusterAppsneedtobeinstalledbytheDeployer• Aminimumof3SearchHeadsarerequiredforaSHC• Noexchange,nodbxwithSHC• ESwillstillrequireaseparateSearchHeadordedicatedSHC• UseLDAP/AD/SSOforuserAuthentication• LoadBalancerconfiguredforstickysessions

32

SearchTier

Search HeadSearch Head Search Head

Load Balancer

Deployer License Server

33

HowManySearchHeads?

Ruleofthumbsays:1per20– 40concurrentqueriesLimitisconcurrentqueriesSearchQuerynormallyusesupto1CPUcore

• 6.3Parallelizationcanleveragemore

Don’taddsearchheads;addindexers:indexersdomostwork• UnlessyouneedHA/SearchClustering

Scaleverticallyifinfrastructureallowsit.AddCPU,addmemory.

SizingExamples

RealWorldExamplesCiscoUnifiedComputingSystem(UCS)

• SearchHead:• UCSC220M4• 24cores• Indexer:• UCSC240M4• 24cores

CiscoValidatedDesign(CVD)forUCS267pageReferenceManualfordeploying1TB/dayonUCSValidatedandBenchmarkedbyCiscoandSplunk

37

DistributedDeployment– CommonComponents

Search-Head 3 XCiscoUCSC220-M4RackServers,eachwith:▫ CPU:2X E5-2680v3(24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller (2GBFBWCcache)▫ CiscoVIC1227▫ 2X600GB15KSFFSASdrives(RAID1)

Admin/MasterNodes 2 XCiscoUCSC220-M4RackServers,eachwith:▫ 2X E5-2620v3(12cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 2X600GB15KSFFSASdrives(RAID1)

NetworkFabric 2 XCiscoUCS6248UP48- PortFabricInterconnects

DistributedDeployment– Retentionvs.Performance

DistributedDeploymentwithHighCapacity DistributedDeploymentwithHighPerformanceIndexer 16XC240-M4rackservers, eachwith:

▫ CPU:2XE5-2680v3(24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 24X1.2TB10KSASinRAID10

2X120GBSSDinRAID1forOS

16XC220-M4rackservers, eachwith:▫ CPU:2XE5-2680v3 (24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 6X800GBSSD-EPinRAID5▫ 2X600GB10KSFFSASHDDw/RAID1forOS

RetentionCapability >1TB/Dayw/1year+retention >1.25TB/Dayw/90dayretention

IndexingCapacity 4TB/Day 8TB/DayIndexingCapacityw/Replication

2TB/Day 4TB/Day

RawIndexCapacity 236TB 64TBExpectedDataCapacity At2:1compression:

472TBAt2:1compression:

128TBKeyUse-Cases ▫ Enterprisesrequiringlargerdataretention ▫ Abilitytosupportlargenumberofconcurrentusersthatrequire

fasterresponse timeServersCount 21(37RU) 21(21RU)Scalability ▫ AdditionalSearch-Head(s)

▫ 1to16additionalIndexers(refertoHighCapacityIndexerconfiguration)

▫ AdditionalSearch-Head(s)▫ 1to16additionalIndexers(RefertoHighPerformanceIndexer

configuration)

CloudDeploymentsCloudConsiderations

• Authenticationrestrictions• Datatransfercosts• Security– SSLTunnel• Zones• Hybriddeployments

VMware http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_VMware_VMs_Tech_Brief.pdf

AWShttps://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-amazon-web-services-technical-brief.pdf

Azurehttp://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-microsoft-azure.pdf

RealWorldExamplesAmazonWebServicesEC2

• SearchHead:• c4.4xlarge+EBSstorage• c4.8xlarge+EBSstorage

• Indexer:• c4.4xlarge+EBSstorage• c4.8xlarge+EBSstorage• d2.4xlarge(IR)

Splunk CloudOverview

FullFeatured Enterprise Ready Easy

WhatWeBuilt

FULLFEATURESETOFSPLUNKENTERPRISE

ACCESSTOAPPS

High availability across Indexers & Search

Heads

Multiple AWS availability zones

Dedicated Cloud environments

- Secure- 10x Bursting

Splunk Cloud fully monitored using Splunk Enterprise

Builtfor100%Uptime

Forward dataSearch

MonitorGet value fast

What You DoHardware setup

StorageScaling

Monitoring

What We Do

Hybrid Search

Search Head(s)

Indexer(s)

Search Head(s)

Indexer(s)

On Premises Private Cloud Public Cloud On Premises Private Cloud Public Cloud

Single Pane of Glass Visibility

SizingRecap

Top5ThingsToConsider

• IndexerStoragerequirements• Minimumbuy-inforaSHCis3• UseVMsforCM/LS/DS/Deployerifpossible• ConsideradedicatedSHformanagement

• DistributedManagementConsole• SplunkHealthCheckOverviewApp• SearchActivityApp

• Whenindoubt– addanotherIndexer

50

MoreIsBetter?CPUs

• 8,12,16,24,32,etc….• Pipelines - New6.3featureforparallelization!• Indexingcanhandlehigherburstswithmultiple indexpipelinesets• Certainsearchescanbeimprovedwithmultiple searchpipelinesets

• Historicalbatch– return thedatawithoutworrying abouttimeorder (…|statscount)• Indexersstillneedtodo theheavylifting (searchexistson indexerANDsearchhead)

Memory• Good forsearchheadsandindexers(16+GB)

• BenefitsfromextraRAMusedbyOSforcaching

Disks• Fasterisbetter- 10k– 15krpmstrongly recommended, SSDpreferred• MoredisksinRAID1+0=Faster• RAID5+1or6canbegood forColdbuckets• SSDscanalsoprovidebenefitforraretermsearchesandmanyconcurrentjobs

PuttingItAllTogether

52

Monitoring

MonitoringToolsSowhat’soutthereandwhat’sthedifference?DistributedManagementConsole(DMC)– Built inandonlyavailableonv6.2+

• http://docs.splunk.com/Documentation/Splunk/latest/Admin/ConfiguretheMonitoringConsole• Splunksupportedandfocusesonallfacetsofthedeployment

FireBrigade• https://splunkbase.splunk.com/app/1632/• Detailed lookatindex/bucketactivityandcapacity

SoS(SplunkonSplunk)• https://splunkbase.splunk.com/app/748/• LegacySplunktroubleshootingtool

SplunkHealthOverview• https://splunkbase.splunk.com/app/1919/• Combinationofviewsfoundtobehelpfulinthefield

Note:Deploymentmonitorappisdeprecated– trytostayawayfromitManyoftheseappfunctionalities arebeingrolledintheDMC

54

Howarethings,overall?Highlevelenvironmentstatus– quickviewofwhat’sup/down/notreporting:

• Forwarderhealth- findingforwardersthatwehaven’tseenforawhile• Datasourcehealth- howareourdatafeedsdoing?• RESTendpoints(|rest/services/server/info)- lookingatsysteminformation,possiblyunderprovisionedones

SpottingwarningsanderrorswithinSplunk_internal:• index=_internalsourcetype=splunkd (log_level=ERRORORlog_level=WARN)|clustershowcount=t|tablecluster_counthostlog_level

message|sort– cluster_count|renamecluster_countAScount,log_levelASlevel• index=_internalsourceype=splunkdlog_level!=INFO|timechartcountbycomponent

Trackresourceusage:• Sayhelloto_introspection(Splunk6.1+)• Capturesdiskandotherresourcemetrics(bydefaultonfullinstalls)• http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Abouttheplatforminstrumentationframework

Dashboardstohelpsavetheday:• HealthStatus- SplunkHealthOverview• Instance- DistributedManagementConsole• IndexingPerformance- DistributedManagementConsole• ResourceUsage- SplunkHealthOverview• LicenseUsage- Splunk HealthOverview 55

EnvironmentOverview

Whatarewereportingon?•_internal•_introspection•metadataandusingtstatshttp://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Tstats

•RESTendpoints• |rest/services/server/info• |rest/services/server/roles• |rest/services/server/status/resource-usage

56

Howtousethetoolsavailabletocheckoverallhealth…

Q&A

Data & Analytics

SplunkLive Sydney Scaling and best practice for Splunk on premise and in the cloud