35
© Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 1 Near real-time network anomaly detection and traffic analysis Pankaj Rastogi Tech Manager Debasish Das Data Scientist

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

Embed Size (px)

Citation preview

Page 1: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

1

Near real-time network anomaly detection and traffic analysisPankaj RastogiTech Manager

Debasish DasData Scientist

Page 2: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

2

Agenda

• Network data overview

• DDoS as network anomaly

• Design challenges

• Trapezium overview

• Results

• Q&A

Page 3: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

3

Network: Aggregated data overview

• Network Management Protocol (SNMP) Network management console Network devices (routers, bridges, intelligent hubs)

• Data collection: Aggregated per router interface

• Inbound and outbound traffic statistics sampled at regular interval- Bits per second (bps)- Packets per second (pps)- CPU- Memory

SNMP Manager

Routers

SNMP ProtocolSNMP Statistics

Page 4: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

4

Network: Flow data overviewWeb browser

192.168.1.10

Web server

10.1.2.3

Request flow #1

TCP connection

Response flow #2

• Flow #1- Source address 192.168.1.10- Destination address 10.1.2.3- Source port 1025- Destination port 80- Protocol TCP

• Flow #2- Source address 10.1.2.3- Destination address 192.168.1.10- Source port 1025- Destination port 80- Protocol TCP

• A single flow may consist of several packets and many bytes

• TCP connections consists of two flows- Each flow will mirror the other- Can use TCP flags to determine the

client and the server

• ICMP, UDP and other IP protocol streams may contain one or two flows

Page 5: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

5

DDoS as network anomaly

Remote command & control

Attacker

Bots

Router

Customer

Attacker + Bots + Customer locations

Attacker + Bots + Customer IPsNetflow SNMP

Customer + Volumetric attack magnitude

Page 6: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

6

SNMP

Anomaly detection on time series

Nonparametric models for SNMP DDOS detection

Page 7: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

7

SNMP

Network Analysis on SNMP• Usage of each router/interface• Find routers that have high packets flow

Page 8: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

8

Anomaly detection on high frequency data

Parametric models for NetFlow DDOS detection

• Generate customer IP focused features based on DDOS definition

NetFlow

0:009/14/15 0:019/14/15 0:029/14/15 0:039/14/15 0:049/14/15 0:059/14/15 0:069/14/15 0:079/14/15 0:089/14/15 0:099/14/15 0:109/14/15 0:119/14/15 0:129/14/15 0:139/14/15 0:149/14/15 0:159/14/15 0:169/14/15 0:179/14/15 0:189/14/15 0:199/14/15 0:209/14/15 0:219/14/15 0:229/14/15 0:239/14/15 0:249/14/15 0:259/14/15 0:269/14/15 0:279/14/15 0:289/14/15 0:299/14/15 0:309/14/15 0:319/14/15 0:329/14/15 0:339/14/15 0:349/14/15 0:359/14/15 0:369/14/15 0:379/14/15 0:389/14/15 0:390:409/14/15 0:419/14/15 0:429/14/15 0:439/14/15 0:449/14/15 0:459/14/15 0:469/14/15 0:479/14/15 0:489/14/15 0:499/14/15 0:509/14/15 0:519/14/15 0:529/14/15 0:539/14/15 0:549/14/15 0:559/14/15 0:569/14/15 0:579/14/15 0:589/14/15 0:599/14/15 1:009/14/15 1:019/14/15 1:029/14/15 1:039/14/15 1:049/14/15 1:059/14/15 1:069/14/15 1:079/14/15 1:089/14/15 1:099/14/15 1:109/14/15 1:119/14/15 1:129/14/15 1:139/14/15 1:149/14/15 1:159/14/15 1:169/14/15 1:179/14/15 1:189/14/15 1:191:209/14/15 1:219/14/15 1:229/14/15 1:239/14/15 1:249/14/15 1:259/14/15 1:269/14/15 1:279/14/15 1:289/14/15 1:299/14/15 1:309/14/15 1:319/14/15 1:329/14/15 1:339/14/15 1:349/14/15 1:359/14/15 1:369/14/15 1:379/14/15 1:389/14/15 1:399/14/15 1:409/14/15 1:419/14/15 1:429/14/15 1:439/14/15 1:449/14/15 1:459/14/15 1:469/14/15 1:479/14/15 1:489/14/15 1:499/14/15 1:509/14/15 1:519/14/15 1:529/14/15 1:539/14/15 1:549/14/15 1:559/14/15 1:569/14/15 1:579/14/15 1:589/14/15 1:592:009/14/15 2:019/14/15 2:029/14/15 2:039/14/15 2:049/14/15 2:059/14/15 2:069/14/15 2:079/14/15 2:089/14/15 2:099/14/15 2:109/14/15 2:119/14/15 2:129/14/15 2:139/14/15 2:149/14/15 2:159/14/15 2:169/14/15 2:179/14/15 2:189/14/15 2:199/14/15 2:209/14/15 2:219/14/15 2:229/14/15 2:239/14/15 2:249/14/15 2:259/14/15 2:269/14/15 2:279/14/15 2:289/14/15 2:299/14/15 2:309/14/15 2:319/14/15 2:329/14/15 2:339/14/15 2:349/14/15 2:359/14/15 2:369/14/15 2:379/14/15 2:389/14/15 2:392:409/14/15 2:419/14/15 2:429/14/15 2:439/14/15 2:449/14/15 2:459/14/15 2:469/14/15 2:479/14/15 2:489/14/15 2:499/14/15 2:509/14/15 2:519/14/15 2:529/14/15 2:539/14/15 2:549/14/15 2:559/14/15 2:569/14/15 2:579/14/15 2:589/14/15 2:599/14/15 3:009/14/15 3:019/14/15 3:029/14/15 3:039/14/15 3:049/14/15 3:059/14/15 3:069/14/15 3:079/14/15 3:089/14/15 3:099/14/15 3:109/14/15 3:119/14/15 3:129/14/15 3:139/14/15 3:149/14/15 3:159/14/15 3:169/14/15 3:179/14/15 3:189/14/15 3:193:209/14/15 3:219/14/15 3:229/14/15 3:239/14/15 3:249/14/15 3:259/14/15 3:269/14/15 3:279/14/15 3:289/14/15 3:299/14/15 3:309/14/15 3:319/14/15 3:329/14/15 3:339/14/15 3:349/14/15 3:359/14/15 3:369/14/15 3:379/14/15 3:389/14/15 3:399/14/15 3:409/14/15 3:419/14/15 3:429/14/15 3:439/14/15 3:449/14/15 3:459/14/15 3:469/14/15 3:479/14/15 3:489/14/15 3:499/14/15 3:509/14/15 3:519/14/15 3:529/14/15 3:539/14/15 3:549/14/15 3:559/14/15 3:569/14/15 3:579/14/15 3:589/14/15 3:590

75,000

150,000

225,000

300,000flow

time

Page 9: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

9

NetFlow

Network Analysis on NetFlow• Find customer with maximum upload bytes• Find customer with maximum download bytes• Find peak usage for given customer

Page 10: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

10

Why we chose Apache Spark

• Good support for machine learning algorithms

• Spark’s micro-batching capabilities > Sufficient for our streaming requirements

• Vibrant Spark community

• Excellent talent availability within our group

Page 11: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

11

Lessons learned -- Spark

• Coalesce partitions when writing to HDFS

• Harmless action like take(1) can result in huge costs

• Multiple actions on a DataFrame/DStreams result in multiple jobs

• Spark DStream checkpointing with RDD models

• spark.sql.parquet.compression.codec – snappy

• spark.sql.shuffle.partitions – 2000+ when partition block size crosses 2 GB

Page 12: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

12

Design challenges

NFS/GFS

Data source?

Algorithms?

Persistence?

Page 13: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

13

Design challenges -- SNMPNear Real time model updates needed Lambda architecture• Batch job MUST process data at fixed interval

(e.g., 15 min)• Stream job MUST

> Handle hot starts (e.g., 90 days of data)

> Analyze data and generate anomalies> Updates model every sampling interval> Start from the last model timestamp on restart

Coordination between Batch and Stream processes NEEDED• Batch job updates ZooKeeper node at fixed

interval (e.g., 15 min)• Stream job uses the same ZooKeeper node to

load features

Page 14: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

14

Design challenges -- NetFlowSeed the model with good parameter estimates

• Batch job populates the initial model parameter• Stream job hot-starts with model and detect

anomalies• Stream job updates the model and persist it to

Cassandra

Model maintained in Cassandra• Stream job read the model to Spark partitions

from Cassandra• Spark partition updates the model• Spark partition generates anomalies• Models across partition are combined using Spark• Anomalies are persisted to Cassandra

Network analysis• Find peak usage for a given customer• Find customer with highest network usage• Find number of distinct source IPs connected to a

destination IP

Page 15: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

15

Network anomaly flow design

Page 16: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

16

Design challenges – multiple applications

Page 17: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

17

Trapezium

Page 18: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

18

What is Trapezium?

Page 19: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

19

What is Trapezium?• Ability to read data

> From multiple data sources, e.g., HDFS, NFS, Kafka> In Batch and Streaming modes to support lambda architecture

• Ability to write data > To multiple data sources, e.g., HDFS, NFS, Kafka

• Plug and Play architecture> Evaluate multiple algorithms> Evaluate different features of same algorithm

• Break down complex analytics problem in Transactions

• Build a workflow pipeline combining different Transactions

• Validation and filtering of input data

• Embedded Zookeeper, Kafka, C*, Hbase, etc available for unit tests

• Enable real time query processing capability> Akka HTTP server provides Spark as a Service

Page 20: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

20

Trapezium architecture

TrapeziumD1

D2

D3

O1

O2

O3

Validation

D1

V1

V1

O1

D2

O2

D3

O1

VARIOUS TRANSACTIONS

Page 21: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

21

WorkflowhdfsFileBatch = { batchTime = 5 batchInfo = [{ name = "hdfs_source" dataDirectory = {prod = "/prod/data/files"} }]}transactions = [{ transactionName="com.verizon.bda.DataAggregator" inputData=[{ name="hdfs_source" }] persistDataName="aggregatedOutput"},{ transactionName="com.verizon.bda.DataAligner" inputData=[{ name="aggregatedOutput" }] persistDataName="alignedOutput"},{ transactionName="com.verizon.bda.AnomalyFinder" inputData=[{ name="aggregatedOutput” }, { name="alignedOutput” }] persistDataName=”anomalyOutput"}]

• Workflow is a collection of transactions in batch or streaming mode

• Each transaction can take multiple data sources as input

• Output of one transaction can be input to another transaction

• Output of each transaction could be persisted or kept only in memory

• Single place to handle exceptions and raise failure events

Page 22: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

22

Transaction Traits

Page 23: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

23

Transaction Traits

Page 24: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

24

Support data sources• Trapezium can read data from HDFS, Kafka,

NFS, GFS

• Config entry for reading data from HDFS/NFS/GFS

dataSource="HDFS"dataDirectory = {

local="/local/data/files" dev= "/dev/data/files"

prod= "/prod/data/files" }

• Config entry for defining protocolfileSystemPrefix="hdfs://"fileSystemPrefix="file://"fileSystemPrefix="s3://"

• Trapezium can read data in various formats including text, gzip, json, avro and parquet

• Config entry for reading from Kafka topics

kafkaTopicInfo = { consumerGroup =

"KafkaStreamGroup" maxRatePerPartition = 970 batchTime = "5" streamsInfo = [{

name = "queries"

topicName = "deviceanalyzer"

}]}

• Config entry for reading fileFormatfileFormat="avro"fileFormat="json"fileFormat="parquet”

Page 25: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

25

Run modes• Trapezium supports reading data in batch as well streaming mode

• Config entry for reading in batch moderunMode="STREAM"batchTime=5

• Config entry for reading in stream moderunMode="BATCH"batchTime=5

• Read data by timestampoffset=2

• Process historical data in sequence of smaller data setsfileSplit=true

• Process same data multiple timesoneTime=true

Page 26: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

26

Data validation• Validates data at the source

• Filters out all invalid rows

• Validates schema of the input data

• Config entry for data validation

validation = { columns = ["name", "age", "birthday", "location"] datatypes = ["String", "Int", "Timestamp", "String"] dateFormat = "yyyy-MM-dd HH:mm:ss" delimiter = "|" minimumColumn = 4 rules = { name=[maxLength(30),minLength(1)] age=[maxValue(100),minValue(1)] }}

Page 27: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

27

Plug and play capability• Any transaction can be

added/removed by modifying workflow config file

• Output from multiple algorithms can be compared in real time

• Multiple features can be evaluated in different transactions

• Data sources can be switched with config change

• Model training can be done on different time windows to achieve best results

Page 28: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

28

Trapezium – github url

https://github.com/Verizon/trapezium

Version: 1.0.0-SNAPSHOTRelease: 14-Oct-2016

Page 29: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

29

Results

Page 30: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

30

SNMPSpark runtime with Hive/C* read/write

Data volume: 10 routers, 2.2 MB per 5 min, 650 MB per day

Compute: 10 executors, 4 cores

Memory: 16 GB per executor, 4 GB driver

With sampling rate of 2 min:• 2 nodes with 20 cores each

for 10 routers

• 200 nodes for 1000 routers

With sampling rate of 4 min:• 2 nodes can process 20

routers

• 100 nodes for 1000 routers

Page 31: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

31

SNMPSpark shuffle – read/write

Data volume: 10 routers, 2.2 MB per 5 min, 650 MB per day

Compute: 10 executors, 4 coresMemory: 16 GB per executor, 4 GB driver

Page 32: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

32

Data volume: 2 router, 50 MB per min, 70 GB per day

Compute: 10 executors, 4 cores

Memory: 16 GB per executor, 4 GB driver

NetFlowSpark + C* read/write runtime

• Due to parametric model, run time is better than SNMP

• NetFlow data is X times more than SNMP data

2 4 8 16 320

25

50

75

100

16 18

32

47

94.8

Router

Run

time

(s)

Page 33: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

33

NetFlowSpark + C* shuffle write

Shuffle (MB) 2 4 8 16 32

Spark 71.2 150.5 275.7 612.1 1261.4

Cassandra 30.2 64.4 115.6 263.7 545.1

2 4 8 16 320

350

700

1050

1400

Spark Cassandra

Router

Shu

ffle

(MB

)

Page 34: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

34

Summary• Reuse code across multiple applications

• Improve developer efficiency

• Encourage standard coding practices

• Provide unit-test framework for better code coverage

• Decouple ETL, analytics and algorithms in different Transactions

• Distribute query processing using Spark as a service

• Easy integration provided by configuration driven architecture

Page 35: Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

© Verizon 2016 All Rights ReservedInformation contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.

35

Thank you