43
Security’s in your DNA: Genomics for InfoSec Rob Bird @conduit242

Security's in your DNA: Genomics for InfoSec

Embed Size (px)

DESCRIPTION

Releases the blar.py tool which creates a genomic encoding from text files. This encoding results in a lossy, highly compressible representation of the original file that can be used for rapid anomaly detection and forensic analysis.

Citation preview

Page 1: Security's in your DNA: Genomics for InfoSec

Security’s in your DNA:Genomics for InfoSec

Rob Bird@conduit242

Page 2: Security's in your DNA: Genomics for InfoSec

What is the most efficient way to analyze a sequence of events?

Page 3: Security's in your DNA: Genomics for InfoSec

What’s a genome?

• The genetic material of an organism• A redundant encoding of instructions• A big sequence of letters

Page 4: Security's in your DNA: Genomics for InfoSec

HIVtggatgggttaatttactccaagcaaagacaagatatccttgatctgtgggtctaccacacacaaggctacttccctgattggcagaattacacaccagggccaggagtcagatacccactaacatttggatggtgcttcaagctagtaccagttgatccagatgaagtagagaaggatactgagggagagaacaacagcctattacaccctatatgccaacatggaatggatgatgaggagaaagaagtattaaggtggaaatttgacagccgcctggcactaaaacacagagcccaagagatgcatccggagttctacaaagactgctgacacagaagttgctgacagggactttccgctgggactttccaggggaggtgtggtttgggcggagttggggagtggccaaccctcagatgctgcatataagcagctgcttttcgcttgtactgggtctctctaggtagaccagatccgagcctgggagctctctggctatctggggaacccactgcttaagcctcaataaagcttgccttgagtgctctaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagaccactctagactgagtaaaaatctctagcagtggcgcccgaacagggactcgaaagcgaaagtaagaccagagaagttctctcgacgcaggactcggcttgctgaggtgcacacagcaagaggcgagagcggcgactggtgagtacgccaatttttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaagcgggggaaaattagatgcatgggagagaattcggttaaggccagggggaaagaaaaaatatagaatgaaacatctagtatgggcaagcagggagctggaaagatttgcacttaaccctggcctgttagaaacaacagaaggatgtcaacaaataatagaacagttacaaccagctctcaagacaggaacagaagaacttagatcattatttaatacagtagtaaccctctattgtgtacatcaacggatagaggtaaaagacaccaaggaagctctagataaaatagaggaaatacaaaataagagcaagcaaaagacacaacaggcagcagctgccacaggaaacagcagcaatgtcagccaaaattaccctatagtgcaaaatgcacaagggcaaatggtacaccaggctgtatcacctaggacattgaatgcatgggtgaaggtaatagaagaaaaggctttcagcccagaagtaatacccatgttctcagcattgtcagaaggagccaccccacaagatttaaatatgatgctaaacatagtggggggacaccaggcagctatgcagatgttgaaagataccatcaatgaggaagctgcagaatgggacaggttacatccagtacaggcagggcctattccaccaggccaattgagagaaccaaggggaagtgacatagcaggaactactagtacccctcaagaacaaataggatggatgacaggcaacccacctattccagtgggagacatctataaaagatggataatcctgggattaaataaaatagtaagaatgtatagccctgttagcattttggacgtaaaacaagggccaaaagaacccttcagagactatgtagataggttctttaaaattctcagagctgagcaagctacacaggaggtaaaaggttggatgacagaaaccttgctggtccaaaatgcaaatccagattgtaagtccattttaagagcactaggaacaggagctacattagaagaaatgatgacagcatgccagggagtgggaggacccggccataaagcaagggttttggctgaggcaatgagtcaagtacaacatacaaacataatgatgcagagaggcaattttaggggtcagagaaggatgattaaatgtttcaattgtggcaaagaaggacacctagccagaaattgcagagcccctaggaaaaagggctgttggaaatgtgggaaagagggacaccaaatgaaggactgcactgaaagacaggctaattttttagggaaaatttggccttccagcaaggggaggccaggaaactttccccagagcaggccagagccaacagccccaccagcagagctctttgggatggaggaagaaaaaacctccgctctgaagcaggagcagaaggacaggaaacaggacccacctttagtttccctcaaatcactctttggcaacgaccccttgtcacagtaaaagtagggggacagctaaaagaagctctattagatacaggagcagatgacacagtattagaagatataaatttgccaggaaaatggaaaccaagaatgatagggggaattggaggttttatcaaagtaaaacagtatgatcagatacttatagaaatttgtggaaaaaaggctataggtacagtattagtaggacccacacctgtcaacataattggaaggaatatgttgacccagattggatgtactttaaatttcccaattagtcctattgagactgtgccagtaaaattaaagccaggaatggatggcccaaaggttaaacaatggccattgacagaagaaaaaataaaagcattaacagaaatttgtacagatatggaaaaggaaggaaaaatttcaagaattgggcctgaaaatccatacaatactccaatatttgctataaagaaaaaagacagcactaaatggaggaaactagtagatttcagagagctcaataaaagaacacaagacttttgggaagttcaattgggaataccgcatccagcgggcctaaaaaagaaaaaatcagtaacagtactagatgtgggggacgcatatttttcagttcctttagatgaaagctttagaaagtatactgcgttcaccatacctagtacaaataatgagacaccaggaatcaggtatcaatacaatgtgctgccacagggatggaaaggatcaccggcaatattccagagtagcatgacaaaaatcttagagccctatagatcaaagaatccagaaataattatctatcaatacatggatgacttgtatgtaggatctgatttagaaatagggcagcatagaacaaaaatagaggagttgagagctcatctattgagctggggatttactacaccagacaaaaagcatcaaaaagaacctccatttctttggatggggtatgaactccatcctgacaaatggacagtacagcctatacaactgccagaaaaggatagctggactgtcaatgatatacagaagttggtggggaaactgaattgggcaagtcaaatttatgcagggattaaagtaaagcaactgtgcaaactcctcaggggagccaaagcactaacagaggtagtaactctgactgaggaagcagaattagaattggcagagaacagggaaattctaaaagaccctgtgcatggagtatattatgacccatcaaaagaattaatagcagaaatacagaaacaagggcaagaccaatggacatatcaaatttatcaagagccatttaaaaatctaaaaacaggaaaatatgcaagaaaaaggtctgctcacactaatgatgtaaagcaattagcagaagtggtgcaaaaggtggtcatggagagcatagtaatatggggaaagactcctaaatttaaactacccatacaaaaagagacatgggaaacatggtggatggactattggcaggctacctggattcctgaatgggagtttgtcaatacccctcccctagtaaaattgtggtaccagttagagaaagaccctatagcaggagcagaaactttctatgtagatggggcagccaatagggagactaagctaggaaaagcagggtatgtaactgacagaggaagacaaaaggttgtttccctaactgagacaacaaatcaaaagactgaactacatgcaatccatctagccttacaggattcaggatcagaagtaaacatagtaacggactcacagtatgcattaggaatcattcaggcacaaccagacaggagtgaatcagaattagtcaatctaataatagaggagctaatagaaaaggacaaggtctacctgtcatgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaattagtcagttccggaattaggaaggtgctgtttttagatgggatagataaagctcaagaagaacatgaaagatatcacagcaattggaaagcaatggctagtgattttaatctgccacctatagtagcaaaggaaatagtagccagctgtgataaatgccaactaaaaggagaagccatgcatggacaggtagactgtagtccaggaatatggcaattagattgcacacatctagaaggaaaagtaatcctggtagcagtccatgtagccagtggttatatagaagcagaagttatcccagcagaaacaggacaagagacagcatactttctactaaaattagcaggaagatggccagtaaaagtagtacacacagacaatggaggcaatttcaccagtgctgcagttaaagcagcctgttggtgggcaaatatccaacaggaatttgggattccctacaatccccaaagtcaaggagtagtggaatctatgaataaagaattaaagaaaatcatagggcaggtaagagatcaagctgaacatcttaagacagcagtacaaatggcagtattcattcacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaaggataatagacataatagcaacagacatgcaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagagatccaatttggaaaggaccagcaaaactactctggaaaggtgaaggggcagtagtaatacaggacaatagtgatatcaaggtagtaccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaggtagacaggatgaggattagaacatggaacagtttagtaaaatatcatatgtatgtctcaaagaaagctcgaaagtggctctatagacatcactatgatagcaggcatccaaaagtaagttcagaagtacacatcccactaggggatgctagattagtagtaagaacatattggggtctgcatacaggagaaaaagactggcaattgggtcacggggtctccatagaatggaggctaagaagatatagcacacaaatagatcctgacctagcagaccaactaattcatctgcattattttgactgtttttcagaatctgccataaggagagccatattaggacaagtagttagccctaggtgtgtatatccaacaggacataaccaggtaggatccctacaatatctagcactgaaggcattagtaacaccaataaagacaagaccacctttgcctagtgttaagatattaacagaggatagatggaacaagccccagaagaccaggggccacagagggaaccatacaatgaatggatgttagaactgttagaagatcttaaacatgaagcagttagacactttcctagaccatgggctaggacaacatatatataacacctatggggatacttgggaaggagtcgaagctatagtaagaattttgcaacaactactgtttgttcatttcagaattgggtgccaacatagcagaataggcattattcaagggagaagagtcagaaatggagccggtagatcctaacttagagccctggaaccatccgggaagtcagcctacaactgcttgtaccaagtgttactgtaaaaagtgttgctatcattgcctagtttgctttctgaacaaaggcttaggcatctcctatggcaggaagaagcggagcaagcgacgacgaactcctcacagcagtaaggatcatcaaaatcctataccaaagcagtaagtatcagtaattagtatatgtaatgagtcctttagaaatctgtgcaatagtaggattgatagtagcgctaatcatagcaatagttgtgtggactatagtaggtatagaatataagagattgttaaagcaaaggaaaatagacaggttaattaagaaaatacgagaaagagcagaagacagtggcaatgagagtgatggggacatggatgaattggcaaaacttgtggagagggggaactatgatcttggggatgttaatgatctgtagtactgcagaaaacttgtgggttactgtctactatggggtacctgtgtggaaagatgcagaaaccaccttattttgtgcatcagatgctaaagcatacgacacagaggcgcataatgtctgggctacacatgcctgtgtacccacagaccccaacccacaagaaatatatttggaaaatgtgacagaagagtttaacatgtggaaaaataacatggtagagcagatgcatacagatataatcagtctatgggatcaaagcctaaagccatgtgtacagttaacccctctctgcgttactttaaattgtaataacatcaccatcaataacatcaccaccaacatcactgaggacatgagaggagaaataaaaaactgctcgtacaatatgaccacagtattaagggataagagaaggaaagtgtattcacttttttatagacttgatatagtaccacttgatgaggggaataataactctgctgggagtagtgactatagattaataaattgtaatacctcaaccataacacaagcctgtccaaaggtctcttttgacccaattcctatacattattgtgctccagctggttttgcgattctaaaatgtaaggatccagatttcaatggaacagggccatgcaagaatgtcagcacagtacaatgcacacatggaatcaagccagtagtatcaactcaactgctgttaaatggcagtctagcagaaggaaaggtaagaattagatctgaaaatattacaaacaatgccaaaaacataatagtacaacttgtcaagcctgtaaaaattaattgtgtcagacctaacaacaatacaagaacaagtgtacgtataggaccaggacaaacattctatgcaacaggtgaaataataggggatataagacaagcattttgtactgtcaatgaatcagaatggaatgaaactttacaacaggtagctacgcaattaagagaacactttgagaacaaaacaataaaatttactaactcctcaggaggggatttagaaattacaacacatagctttaattgtggaggagaatttttctattgtaatacatcaggcctgtttaatagcacctggaataataataataccagggagaagataaatggtacagagtcaaatagcactataactctccattgcagaataaagcaaattataaataggtggcaggaagtaggacaagcaatgtatgcccctcccatcccaggagtaataaattgtagatcaaacattacaggactaatattaacaagagatggtggggatggggataacaatacggaaatcttcagacctggaggaggaaatatgaaggacaattggagaagtgaattatataagtataaagtagtaaaaattgaaccactgggagtagcacccaccagggctaagagaagagtggtggagagagcaaaaagagcagttggaataggagctgttttccttgggttcttaggagcagcaggaagcactatgggcgcggcgtcaataacgctgacggtacaggccagacaattattgtctggcatagtgcaacagcaaagcaatttgctgagggctatagaggctcaacaacatctgttgaaactcacggtctggggcattaaacagctccaggcaagagtccttgctgtggaaagatacctgcaggatcaacagctcctaggaatttggggctgctctggaaaactcatctgcaccactaatgtgccctggaactctagttggagtaataaatctcagagtgagatatgggagaacatgacctggctgcaatgggataaagaaattagcagttacacaggcataatatataaactaattgaagaatcgcagaaccagcaggaaaagaatgaacaagacttattggcattggacaagtgggcaagtctatggaattggtttgaaatatcaaagtggctgtggtatataaaaatatttataatgatagtaggaggattaataggattaagaatagtttttgctgtgctttctataatcaatagagttaggcagggatactcacctttgtcatttcagacccacaccccaaacccaagggaacccgacaggcccgaaagaatcgaagaagaaggtggagagcaaggcagagacagatcgatacgcttagtgagcggattcttagcacttgcctgggacgacctacggagcctgtgccttttcagctaccaccgcttgagagacttcatcttgattgcagcgaggactgtggaacttctgggacacagcagtctcaaggggttgagactggggtgggaaagcctcaagtatctggggaatcttctgctatattggagtcaggaactaaaaattagtgctgttaatttagttgataccatagcaatagcagtagctggctggacagataggattatagaaacaggacaaagattttgtagagctcttctcaacgtacctagaagaatcagacaaggatttgaaagggctctgctataacatgggtggcaagtggtcaaaaagtagcatagtgggatggcctgagattagggaaagaatgaggcgtgctcctccagcagcaaaaggagtaggagcagtatctcaagatttagataaatttggagcagttacaagcagtaatatgaatcaccctagttgcgtctggctggaagcacaagaggaaacggaggtaggctttccagtcaggccacaagtacctctaaggccaatgacttacaagggagcagtggatctcagccattttttaaaagaaaaggggggactggaagggttaatttactccaagcaaagacaagatatccttgatctgtgggtctaccacacacaaggctacttccctgattggcagaattacacaccagggccaggagtcagatacccactaacatttggatggtgcttcaagctagtaccagttgatccagatgaagtagagaaggatactgagggagagaacaacagcctattacaccctatatgccaacatggaatggatgatgaggagaaagaagtattaaggtggaaatttgacagccgcctggcactaaaacacagagcccaagagatgcatccggagttctacaaagactgctgacacagaagttgctgacagggactttccgctgggactttccaggggaggtgtggtttgggcggagttggggagtggccaaccctcagatgctgcatataagcagctgcttttcgcttgtactgggtctctctaggtagaccagatccgagcctgggagctctctggctatctggggaacccactgcttaagcctcaataaagcttgccttgagtgcb

Page 5: Security's in your DNA: Genomics for InfoSec

Basics

• Letters (nucleotides)– 4 in DNA, A,G,C,T

• Codons– Triplets of nucleotides e.g. GAA

• Genomes have coding regions (proteins) & non-coding regions (other)

• One strand can be read forward, the other in reverse

Page 6: Security's in your DNA: Genomics for InfoSec

It’s all about the Codons

• The Genetic Code is a dictionary of Codons

• 64 entries (4^3)

Page 7: Security's in your DNA: Genomics for InfoSec
Page 8: Security's in your DNA: Genomics for InfoSec

Analyzing Genomes

• Compare them to each other– Alignments (e.g. Smith-Waterman, etc.)– Distances• Levenshtein (edit) distance (metric)• Longest Common Subsequence distance

(metric)• Normalized Compression Distance (metric)

– Optimal Grammars• Pisa.c: Optimal sequence grammar search

using hyperstring encodings

Page 9: Security's in your DNA: Genomics for InfoSec

Analyzing Genomes

• Look for interesting regions– Information gain (Kullback-Leibler Div)– Coding Costs (Kolmogorov Complexity)– Decaying Coding Costs (Lossy

Kolmogorov Complexity)

Page 10: Security's in your DNA: Genomics for InfoSec

Rule 1:Size doesn’t matter

Page 11: Security's in your DNA: Genomics for InfoSec

Smallest(almost)

• Mycoplasma Genitalium• 580,000 bp

Page 12: Security's in your DNA: Genomics for InfoSec

Largest

• Polychaos Dubium• 670 billion bp

Page 13: Security's in your DNA: Genomics for InfoSec

Rule 2:Repetition matters

Page 14: Security's in your DNA: Genomics for InfoSec

Don’t say that again

• Sections of DNA that do not repeat are the most important

• Protein coding genes and RNA coding genes are non-repetitive

• Higher-order creatures are largely repetitive

Page 15: Security's in your DNA: Genomics for InfoSec

Rule 3:Compression is hard

Page 16: Security's in your DNA: Genomics for InfoSec

Putting the squeeze on

• Normal compressors ~ 2bit codes• Special genetic compressors exist• Compressibility equates to sequence

predictability for the model in use

Page 17: Security's in your DNA: Genomics for InfoSec

So what does this have to do with security???

Page 18: Security's in your DNA: Genomics for InfoSec

A Question

If we could convert sequences of logs, packets, etc. to a genomic encoding,

could we use genomic analysis to dramatically speed up & improve forensics, incident response and

anomaly detection?

Page 19: Security's in your DNA: Genomics for InfoSec

YES

Page 20: Security's in your DNA: Genomics for InfoSec

How?

• Step 1: Convert events into alphabet• Step 2: Convert stream into string of

letters• Step 3: Money bath

Page 21: Security's in your DNA: Genomics for InfoSec

A Naïve Solution

• Step 1: Hash each input, use hash value as a letter

• Step 2: Create stream of hash values• Step 3: #fail

Why?

Page 22: Security's in your DNA: Genomics for InfoSec

Answer

• The alphabet is too big • The stream will need at least

2^(2^<hash_key_size) examples• Stream is virtually unpredictable

Page 23: Security's in your DNA: Genomics for InfoSec

Enter blar.p

y

Page 24: Security's in your DNA: Genomics for InfoSec

WTF is a ‘blarp’?

• Let’s ask Google• The sound a fat person makes being

fat• The sound of taking big fat data and

making it useful & efficient small data

• A cool little python tool for creating and analyzing genomic encodings

• The last two will not be found on Google…yet

Page 25: Security's in your DNA: Genomics for InfoSec

Idea

• We want similar events to be represented by a single letter

• Hashes are random projections• Let’s use geometry instead

Page 26: Security's in your DNA: Genomics for InfoSec

Position in space

• To precisely locate something in space D, you need dist. to n=D+1 reference points

• Key notion: To get something’s general area you can use n<<D+1 reference points

Page 27: Security's in your DNA: Genomics for InfoSec

Locality-Sensitive Hashing

• Created by Yahoo in late 90’s• Used within indexing for text lookups

on massive data sets• Many hashes; data-type dependent• Question: What if you thought about

it as a ‘general area’ hash instead?

Page 28: Security's in your DNA: Genomics for InfoSec

How it works

• Basic type: Random Projection• Given a numeric vector (e.g. 1, 15, 3,

14.8) calculate its dot product vs. a random vector

• If result is positive, call it a ‘1’• If negative, call it a ‘0’• Repeat• Concatenate binary together, result

is LSH

Page 29: Security's in your DNA: Genomics for InfoSec

Blar.py Pipeline

Vectorize Input

Find Locality

Sensitive Hash

Convert to UTF-16 char

Output stream of UTF-

16

Analyze sliding

window over

genome stream

Score Chart stuff

Page 30: Security's in your DNA: Genomics for InfoSec

Vectorizing

• Idea: Count things that matter, take measurements, etc. and create an array to hold that information

• Where the rubber meets the road• Lots of chances for domain expertise

Page 31: Security's in your DNA: Genomics for InfoSec

Basic Vectorizing in Blar.py

• Basic model: character n-grams• Also known as Markov chains or Bag of

Letters• Counts up sliding windows of text• E.G. 2-grams for ‘sassyfrassy’sa: 1 as: 2 ss: 2 sy: 2 yf: 1 fr: 1 ra: 1For 256^2 length array (1,0…0,2,0…0,2,0…

Page 32: Security's in your DNA: Genomics for InfoSec

Let’s Vectorize Better

• Use Feature Hashing otherwise known as the hashing trick

• Find hash mod length and increment counter for each model pattern

• Permits lossy counting with graceful random collisions

• Blar.py uses length 64 by default and xxHash

Page 33: Security's in your DNA: Genomics for InfoSec

Blar.py code

1. def feature_hash_string(s, window, dim):2. # Generate window-char Markov chains & create feature

hashes3. chains = [(xxhash.xxh32(s[i:i+window]) % dim) for i in

xrange(len(s)-(window-1))]

4. # Initialize counter array5. counters = numpy.zeros(dim)

6. # Count instances of feature hashes7. for i in range(len(chains)):8. counters[chains[i]] += 19. # Return feature hash count vector10. return counters

Page 34: Security's in your DNA: Genomics for InfoSec

Now let’s find the LSH

1. # Use random projection for LSH and output a UTF char for the locality-sensitive hash

2. def locality_hash_vector(v, width):3. hash = numpy.zeros(width, dtype=int)4. for x in range(0, width - 1):5. projection = numpy.dot(COMP_VECTORS[x], v)6. if projection < 0:7. hash[x] = 08. else:9. hash[x] = 1

10. # Return unicode char equal to the LSH11. return unichr(int(''.join(map(str, hash)),2))

Page 35: Security's in your DNA: Genomics for InfoSec

Blar.py analysis

• Analyzes 4 character sequences and assigns a decaying version of the optimal coding cost to each line

• Tells you how interesting a certain event is relative to everything else in the genome, accounting for ordering

• Blar.py Genomes are extremely compressible using bzip especially

Page 36: Security's in your DNA: Genomics for InfoSec

Blar.py defaults (ATM)

• 4 character sliding windows• 4 bit hashes• 64d feature hashes• Outputs a list of the most interesting

scores• Outputs a few bad charts

Page 37: Security's in your DNA: Genomics for InfoSec

Blar.py vs. Toy File1. Mary had a little lamb whose fleece was white as snow.2. Mary had a little lamb whose fleece was white as snow.3. Mary had a little lamb whose fleece was white as snow.4. Mary had a little lamb whose fleece was white as snow.5. Mary had a little lamb whose fleece was white as snow.6. Gary had a little hand whose hair was as white as blow.7. some more strings8. some more strings9. some more strings10.some more strings11.some more strings12.John McAfee was the keynote for Skytalks.13.John McAfee was the keynote for Skytalks.14.John McAfee was the keynote for Skytalks.15.some more strings16.some more strings17.some more strings18.John McAfee was the keynote for Skytalks.19.John McAfee was the keynote for Skytalks.20.FOO BAR BAS

Page 38: Security's in your DNA: Genomics for InfoSec

Blar.py vs. Toy File

Page 39: Security's in your DNA: Genomics for InfoSec

Blar.py vs. Toy File(Look Raffy, I’m using the completely inappropriate chart type)

Page 40: Security's in your DNA: Genomics for InfoSec

Blar.py vs. BlueGene/L

• From the Usenix Computer Failure Data Repository

• 1.2GB combined log file from 131,072 processors for six months

• 119MB compressed with gzip• 9.4MB blar.py genome• Blar.py ~1000 lines/sec

Page 41: Security's in your DNA: Genomics for InfoSec

Blar.py vs. BlueGene/L

Page 42: Security's in your DNA: Genomics for InfoSec

Blar.py vs. BlueGene/L

Page 43: Security's in your DNA: Genomics for InfoSec

TL;DR

• Fast, accurate, free: Blar.py genomic encoding tool provides very fast, low noise anomaly detection

• Stop searching in a crisis: Great way to quickly explore data for IR, forensics, etc., especially from unknown sources

• Want it? Follow me @conduit242 for the GitHub posting announcement