PCAP Graphs for Cybersecurity and System Tuning

Preview:

Citation preview

1© Cloudera, Inc. All rights reserved.

Mirko Kämpf Solutions Architect

@semanpixmirko@cloudera.com

Network Traffic Analysis of Hadoop Clusters

Understand the common usage patterns and identify typical / atypical workloads

Marton Balassi Solutions Architect

@MartonBalassimbalassi@cloudera.com

2© Cloudera, Inc. All rights reserved.

Outline

•Motivation

•PCAP data capture

•Data Analysis with CDH

•Data Analysis with Gephi

• Summary

3© Cloudera, Inc. All rights reserved.

Understand the network load of a Hadoop cluster

• Network communication is often the limiting factor in distributed computing

• Storing files on DFS, heartbeats, data processing all have a footprint

• The current standard visual tools aggregate data on the host level

• Intrusion detection is critical in enterprise systems (Apache Spot)

4© Cloudera, Inc. All rights reserved.

PCAP Data Capture

• Packet capture, the standard API for capturing network traffic• Implementations: Libpcap for UNIX, WinPcap for Windows• Multiple analysis tools: tcpdump, nmap, Wireshark, Snort amongst others

Our approach:• Used pcapy, the Python pcap

extension for capturing• The capturing is initiated on the

individual machines• The captured data is written to the

local fs in Avro format, while the capturing is active

• Focus on network structure, packet data is ignored

Avro schema for PCAP data

5© Cloudera, Inc. All rights reserved.

• Formerly known as ONI• Initiative of Cloudera, Intel, and partners

• Focus on Cybersecurity for the Hadoop domain• Common data formats for advanced analytics• Reliable and robust data (ingestion) pipelines• Repeatable and reliable analysis and modeling procedures• Apache Spot uses a topic-model (LDA) approach, to classify traffic

Apache Spot

• We focus on clustering and visualization of typical workloads in this talk instead.

6© Cloudera, Inc. All rights reserved.

We implemented multiple ‘typical workloads’ and observed their behavior.

• Create reference data sets (PCAP data):• Scenario A: TeraSort (Big-Batch-Workload)• Scenario B: HDFS PUT,GET; HUE (Interactive Workload)• Scenario C: Idle cluster (Vacation time)• Scenario D: Kafka => Spark => HDFS (Realistic production Workload)• Scenario E: Twitter => Spark => HDFS (Realistic production Workload)

Our Activities

7© Cloudera, Inc. All rights reserved.

How it Works ...

• We collect raw data in Avro format, using the Snaffer (pcapy) script.• We transform the events to networks, using Hive (SQL API on Hadoop).• We analyze and visualize the networks using Gephi (open graph viz platform).

8© Cloudera, Inc. All rights reserved.

Initial Results: TeraSort

9© Cloudera, Inc. All rights reserved.

Initial Results: Twitter Collect

10© Cloudera, Inc. All rights reserved.

• Use a higher resolution: include ports in addition to hosts only • Use time dependent analysis: track time stamps per packet

• Combine time series analysis and graph analysis: use Gephi and Apache Spark

Let’s have a look inside ...

11© Cloudera, Inc. All rights reserved.

???

All ports on all hosts, used during an experiment …

12© Cloudera, Inc. All rights reserved.

Hosts & Ports in a 5 Node Hadoop Cluster

Static network:• 1.535 nodes• 2.997 edges

Network-clusters represent communication ports on individual hosts (bigger nodes in theCenter of the star) forming a Hadoop cluster.

This static view shows all potential communicationendpoints – no activity yet.

13© Cloudera, Inc. All rights reserved.

Weighted Communication Links during TeraSortCommunication Network• 1.535 node • 34.351 edges

Communication links represent real communication between ports on individual hosts in a Hadoop cluster.

This dynamic view shows all real communicationendpoints and allows a topological analysis.

15© Cloudera, Inc. All rights reserved.

PageRank & Eigenvector CentralityTopological Properties

Node sizes represent PageRank of a node based on Communication links.

Node colors still reflect the host on which the communication endpoints are active.

Node sizes represent Eigenvector Centrality based on Communication links.

Node colors still reflect the host on which the communication endpoints are active.

16© Cloudera, Inc. All rights reserved.

Which are the most central nodes?

NameNode

ResourceManager(internal) t.

Interesting ports:

2300023001

70518022

Most active Server:

172.28.209.73

17© Cloudera, Inc. All rights reserved.

Time Evolution of Dynamic Communication Processes

Host centricHost = Server

Cluster centricCluster = Functional Layer

18© Cloudera, Inc. All rights reserved.

Re-organization: Segregation by Components• Communication components are distributed

across servers.• Server centric analysis doesn’t help

• Communication layers can be interdependent.• Dependencies are not visible in event data set.

• Our Approach:• (1) Re-construct the communication structure.• (2) Segregate the communication activity by component / subsystem.• (3) Finally, we reconstruct the functional network of interacting components.

• This allows a dependency analysis for components, and hopefully also system tuning.

19© Cloudera, Inc. All rights reserved.

!!! WARNING !!!

Absolute values can be misleading.

Component Centric View

• port <=> host links removed

• Temporal networkslead to dynamicclusters

20© Cloudera, Inc. All rights reserved.

Central vs. External ComponentsImpact of the Selected Layout Algorithm

21© Cloudera, Inc. All rights reserved.

Two Experiments: TeraSort & Twitter Collect

Num

ber o

f pac

kets

22© Cloudera, Inc. All rights reserved.

5 Selected Channels during TeraSort

Num

ber o

f pac

kets

NameNodeNodeManager

YARN App ContainersYARN App ContainersYARN App Containers

Job AJob B

Job C

Replication factor: Job A : 3 Job B : 1 Job C : 5

23© Cloudera, Inc. All rights reserved.

5 Selected Channels during Twitter Collect

Num

ber o

f pac

kets

Active

Idle

24© Cloudera, Inc. All rights reserved.

Observations

• Both experiments show fundamental differences:•Only one active component vs. multiple competing communication channels.

• Common observation:•Background activity of an idle cluster shows periodic spikes (no surprise).•Different fluctuation levels on different channels

25© Cloudera, Inc. All rights reserved.

What’s next?

More Experiments & Data collection:• Ideal scenarios• Realistic workloads

Helpful Vizualization:• Provide a real time view of ongoing network activity using

Gephi streaming plugin (as shown in the Twitter Streaming demo).

Better Analysis:• Classify the components automatically …• Requires: to study activity time series,

e.g., using neuronal networks or non-linear statistics.• Understand the component structure and behavior over time …• Allows us: to find anomalies in the component structure and behavioral patterns.

27© Cloudera, Inc. All rights reserved.

Big Thanks To

Clouderans supporting the project ...

Alexander Bartfeld

Anton Vukovic

Rafael Arana

Zoltan Kiss

Nehme Tohme

28© Cloudera, Inc. All rights reserved.

Thank you@semanpixmirko@cloudera.com

@MartonBalassimbalassi@cloudera.com

Recommended