Dmk blackops2006 ccc

Black Ops 2006Viz EditionCCC 2006

Dan Kaminsky

Director Of Penetration Testing

IOActive

Thanks and No Thanks

• Thank You To Swissotel Amsterdam, who provided a net connection with which I could actually finish these slides

• No Thanks to Delta Hotel of Amsterdam, which put a TV on a really weak shelf.– I suppose it’s my fault I put my laptop

underneath.– The “Star System” is officially meaningless

Who Am I?

• Coauthor of several book series– Hack Proofing Your Network– Stealing The Network

• Formerly of Cisco and Avaya– Presently partnering with IOActive– One of the “Blue Hat Hackers” that has been

auditing Windows Vista

• Been doing talks for six years now– TCP/IP, DNS, MD5, SSH, etc.

What Are We Here To Do?

• Break TCP/IP A Little More– Not in the documentation– It’s for a good cause ;)

• Analyze Data Linguistically

• Make Pretty Pretty Pictures!

For Various Definitions Of Pretty:Visual Bindiff

The Ancient Tongue:TCP/IP

• Can’t all be about pretty pictures • A new problem has popped up: Network

oligopolies are threatening to install firewalls that limit or eliminate bandwidth on a per-company basis– Their own media services might be fast,

others will be slow– Their own VPN services might be fast, others

will be slow

• Question: Is it possible to detect and locate devices violating network neutrality?

What’s The Closest Tool We Have?

• Firewalk– Mike Schiffman’s Firewall Analysis Tool– Packets elicit a ICMP Time Exceeded error if

they reach a router with TTL=0• TTL decremented by one for each hop, so you

start low, you can trace the route to a host

– A firewalled packet won’t live long enough to reach TTL=0

– So you can locate the firewall, and divine things about its ruleset, based on when your packets stop getting ICMP Time Exceeded

Limitations of Firewalking

• But Firewalk tells us what, not who is blocked…and it tells us nothing about who is allowed to go fast, and who is made to go slow– Suddenly, we devolve to a much older

question: Is it possible to find out that a target firewall is, or is not, blocking against or accepting traffic from an arbitrary IP address?

TCP Does Speed Measurement

• TCP speed analysis done blindly– Endpoints do not negotiate with one another– Everyone sends their packets, routers route

what they will. Endpoints need to adjust to what the routers are willing to pass.

• Routers communicate with endpoints by dropping their packets

• Can we combine this router backchannel w/ Firewalk?

In From The Side

• What causes packets to drop?– Too many packets

• What are we going to do?– Send too many packets

• Two channels are set up– A primary channel, which drops packets at some

known rate– A secondary channel, whose purpose it is to interfere

(or not) with the primary channel• When the secondary interferes with the primary,

we get feedback via the primary channel– The traffic composing the secondary channel can

come from anywhere, be composed of anything, and can be TTL’d just like in a normal firewalk.

The TTL Channel

• Normally, you don’t know which router along a path is dropping your packets

• If you are the source of the drop-inducing packets, you can control how far your noise goes out – thus, you can discover which router is hitting its limit / censoring your net connection

Scorchmarking

• Why Scorchmarking?– Routers are burning packets…those that get through

might have a scorch mark or two

• Basic Model– Client downloads a file from a site, at some given

speed negotiated via TCP.– At the same time, traffic is injected from different IP

addresses. This should cause drops.• If it doesn’t, the network is either penalizing the primary

channel (easy to drop against) or rewarding the secondary channel (resilient to drops)

Advanced Scorchmarking [0]

• Having to depend on a client is lame– Wouldn’t it be nice if we could scan the

Internet for these servers?

• What fundamental service is a receiving client providing?– It is acknowledging our traffic – letting us

know how much it received, and how many milliseconds it took to receive it

• Aren’t there other ways we could extract the same data from hosts?


• What else will acknowledge receiving traffic from us?– TCP Servers

• Sting, from Stefan Savage, used this to great effect

– DNS Servers – Routers.

• Supposedly, routers won’t send more than a certain number of ICMP Time Exceeded packets per second

• In reality, they seem to ICMP Time Exceeded ACK however much you throw at them

• Even if they didn’t, you could use the difference in ICMP Time Exceeded rates between Primary and Secondary channel, to determine whether interference was showing up.

• Everyone’s got a NAT – so you can query everyone for whether certain sorts of traffic are being blocked to them


• So, yes.– You can scan for violations of Network Neutrality– You can find networks that are blocking or passing

particular IP ranges

• It’s not exactly efficient though• Neutrality violations are easier to find than the

standard FW case– Firewalls are normally between the WAN and the LAN

(Slow Net -> FW -> Fast Net)– Neutrality violators are mid-WAN (Slow Net -> Fw ->

Slow Net -> Fast Net)– Easier to overload the slow net after the firewall

• Boxes with max TTL rates override this

Speed Limits

• Fundamental Problem: Have to max out bandwidth on the link to trigger the backchannel– No packets dropping, no data– Means you have to DoS a link – not scalable/legal

• Potential Solution: Find capped acknowledgers– The mythical ICMP Time Exceeded rate limit works

well• Primary and Secondary channel both eliciting ITE’s• When secondary channel gets a packet through, it takes up

a slot on the primary channel’s • ITE is perfect, since you can TTL limit any packet• Depends on the firewall passing the primary’s ITE’s• Maybe Linux / NATs actually implement rate limits?

– Another option: What if we have code on the client?

Windows Media Player:More Than Just DRM. Really!

• Bulk Transfer: RTP– Runs over Unicast UDP– Yes, the same Unicast UDP that penetrates NAT so

well!

• Flow Control / Quality Monitoring: RTCP• No technical reason RTCP needs to go back to

the same address that RTP stream is coming from– So: We pretend to provide media streams from all

sorts of sites, and use WMP to collect traffic stats for us

• It might work…

Symbols

• But this is not to be a talk on TCP/IP hackery…

SSH’s Hex Problem

• $ ssh dan@blahThe authenticity of host 'blah (1.2.3.4)' can't be established.RSA key fingerprint is 09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17:f8:83:01.Are you sure you want to continue connecting (yes/no)?

• 09:a9:b1…am I supposed to do something with this?– Yes. According to SSH’s design, you’re supposed to

reject the proposed fingerprint if it looks unfamiliar. (Seriously.)

• The “Two Billion SSH Key” attack (by ADM) just comes up with 2B keys and emits the visibly closest key. It works.

Hex sucks.A better mapping must be possible…

Cryptomnemonics

• There are three classes of memory, at least to the degree as is useful in cryptography– Rejection: “I’ve never seen that before”– Recognition: “It’s that one, not that other one”– Recollection: “Let me describe it to you.”

• SSH just requires rejection – “What? That’s new.”

• Hex domain clearly does not work. What else is available?– To restate the problem: Humans do not operate on

hexadecimal symbols effectively. Are there any other symbol sets we can use?

Alternative Symbolic Domains

• Abstract Art via déjà vu• Calculated faces via

Passfaces• Both have attempted to

address limited capacityfor recollection by movingauthentication to arecognition problem

• But recognition offers onlya limited number of bits:9^5=59049 < 2^16– This is OK, since Passfaces is

online and thus can lock a userout before 59K attempts are up

– We are not online – but we onlyneed to reject, not recognizeand certainly not recollect

The Nymic Domain:Names Are Identity Symbols

• Humans don’t remember arbitrary bits, but we do remember stories.

• Stories changes (the bits shift over time), but names stay the same

• Can we map the 160 bits SSH needs us to accept or reject, to names?– Take 512 male names: 9 bits of info per male name– Take 1024 female names: 10 bits of info per female

name– Take 8192 last names: 13 bits of info per last name– 9+10+13=32. 5 couples = 160 bits

Demo• $ ssh dan@blah

Key Data: julio and epifania dezzutti luther and rolande doornbos manual and twyla imbesi dirk and cuc kolopajlo omar and jeana hymel

The authenticity of host 'blah (1.2.3.4)' can't be established.Are you sure you want to continue connecting (yes/no)?

• It is critical that the Key Data be shown every time there’s a connection. The user must become familiar with the “characters” in the “story”.– This actually seems to work.

What about Bubble Babble?

• $ ssh-keygen.exe -B -f id_dsa.pub 1024 xegoz-tosys-vusik-masar-cifyc-cyled-kikih-zukuf-nypok-sezyt-noxax id_dsa.pub

• Problem: Humans do not remember arbitrary sequences of syllables well

• Names are special sequences – sharing with pre-existing language logic should improve retention– Still, names are arbitrary (Bhoutros-Bhoutros Ghali);

could merge approaches:Xegoz and Tosys VisukMasar and Cifyc CyledKikih and Zukuf NypokSezyt Noxax

– Requires testing

Inverting The Symbol Flow:Passnyms

• Suppose you have 8 characters with one of 64 characters in each slot.– aI7$13nM– 64==2^6, so (2^6^8) == 48 bits– “Lowercase A, lowercase l, seven, dollar sign, one,

three, lower case n, upper case M”• This is twenty three syllables!

• What if, instead, you typed:– dirk and cuc kolopajloomar and jeana hymel

– 64 bits of entropy, 14 syllables, can be spell checked as user types it in

It Is Easier To Interface With Systems When Symbols Align

• Hacking is a form of interfacing • We can break things with garbage symbols

– “Dumb Fuzzing”: Take a file, flip some bits, see what happens

• We can break more things with meaningful symbols used in unexpected ways– “Smart Fuzzing”: Take a file, understand its internal

structure, fuzz the structure, see what happens• Dumb fuzzing is very easy.• Smart fuzzing is very labor intensive…requires

smart people, maybe specifications.• Is there any way we can automatically discover

symbol sets?

File Formats Are Languages

• Kids don’t get documentation when they learn new languages. They just pick ‘em up.– They can do this because they actually design

all sorts of internal structure and redundancy into them.

• Children make languages.• Adults make working languages.• Programmers make barely working

languages.– Lets autodiscover them!

N’est’ce pas Non Sequitur

• Sequitur: Linear Time Pattern Finder– Creates hierarchal Context Free Grammars from arbitrary input

• Compression Algorithm in which you can “look under the covers” to see what’s going on

• Created by Craig Neville-Manning as his PhD thesis a decade ago– He’s now Chief Research Scientist at Google

Syntax Highlighting For Hex Dumps

• Trivial Algorithm: In a hierarchical grammar, each byte requires traversing to a certain depth in order to recover the raw literal.

• Color each byte by how deep in the tree you have to go.

BLUR-O-VISION

What’s Actually Going On?

• (0) -> … (73),b4,(73),ca,(73),e6,(73),02,(74),18,(74),2c,(74),4a,(74),5c,(74),6e,(74),80,(74),98,(74),b0,(74),c8,(74),e8,(74),fc,(74),10,(75),20,(75),30,(75),40,(75),50,(75),64,(75),82,(75),90,(75),9e,(75)…(84),d6,(84),ee,(84),0c,(85),28,(85),3c,(85),4e,(85),66,(85),7e,(85),8c,(85),9e,(85),ac,(85),be,(85),ca,(85),ea,(85),08,(86),26,(86),44,(86),56,(86),6a,(86),7c,(86),8a,(86),a6,(86),b6,(86),cc,(86),de,(86),02,(87)

• Repeated sequence, single byte literal. Repeated sequence, single byte literal. Rinse, lather, repeat.

Intersymbol Link Discovery

• Turns code on left intosymbolic set on right;it’s easy then to linkthe symbols togetheras per the graph.

• This works for non-textual data• Sequitur imputes meaningful

symbols from arbitrary inputdata

Context Free Grammar Fuzzer:THE CFG9000

• Reduce input data to a stream of symbols• Fuzz data at the symbol level, rather than at

pure bytes– Shuffle– Drop– Repeat– Uniform Corrupt

• Consistently corrupt all instances of a given symbol• <HEAD> -> <FOOBAR>

• Sequitur is not necessarily the best way to generate a grammar.– Doesn’t handle recursion, common in genomic data– Suffix trees may yield better output– Sequitur may scale better (100MB input not an issue)

Sample CFG9000 Output

• calculate_rule_usage(p->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rule() }

• calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(p->rule());

Slashdot Fuzzed

Slashdot Fuzzed (2)

It’s Not The Best CFG Fuzzing Ever…

• Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.– Rooter: A Methodology for the Typical Unification of Access

Points and Redundancy– By A Context-Free Grammar Generating CompSci Papers

• Authors handcoded “meaningful symbols” in CompSci speak. The eventual goal is the autogeneration of symbol and inter-symbol patterns.

Symbolic Discovery Is Inevitable

• “An early inference procedure was described by Chomsky and Miller (1957a), as reported in Solomonoff (1959). Chomsky proposed a method for detecting loops in finite state languages. The approach requires a set of valid sentences, and an oracle that determines whether a sentence is in the language.

The algorithm proceeds by deleting part of a valid sentence and asking the oracle whether the sentence is still valid. If it is, the deleted part is reinserted into the sequence and repeated, so that it appears twice. If the sentence is still in the language, a cycle has been detected.”– Inferring Sequential Structure, Craig Neville Manning, 1996– This couldn’t POSSIBLY be useful for building a structure

for a dumb fuzzer to operate against.• Instead of seeing if the parser crashes, just see if it considers

the input valid

TODO

• “Requitur”; Sequitur implementation optimized for fuzzer use– Generate larger symbols

• No two byte symbols please; we’re not trying to compress, we’re trying to elucidate structure

– Eliminate redundant symbols• Keiffer-Yang optimization in ~2001: If symbol (x) == symbol

(y), then delete (y) and set all instances of (y) to (x)• Need to do this to actually consistently fuzz all instances of a

particular trope

– Possibly remove in-memory grammar requirement• Use mechanisms from Ray, a out-of-memory variant

– Add foreign grammar capability

What’s Out Now

• 8 Bit Clean – Can Analyze Arbitrary Data

• Mergedot – Can create graph from Sequitur output

How To Think Of Sequitur

• Any time you’re manipulating data as bytes, think of manipulating it as symbols– Trigram histograms on bytes -> Trigram

histograms on symbols– Bayesian probabilities on characters ->

Bayesian probabilities on symbols– Adapt yourself to more than 256 codes per

symbol and reap the benefit• If your code is already Unicode aware you might

be one step ahead!

Fuzzy Wuzzy Wuz A Symbol

• Symbol analysis systems (language translators, etc) have issues w/ TMTOWTDI (There’s More Than One Way To Do It)– Very similar messages can be encapsulated in very

different ways– Very similar messages can be encapsulated in very

similar, but not identical ways

• Sequitur only handles exact matches – fuzzy grammar imputation doesn’t appear to exist yet– Are there any systems for analyzing complex, inequal

but somewhat related sets of symbols?

Another Approach: DotPlots

• Popular mechanism in bioinformatics for visual analysis of genomes.

• Some attempts to apply dotplots outside of bioinformatics– Textual analysis– Audio

• Remembered an old paper, entitled Visualizing Music And Audio Using Self-Similarity– Jonathan Foote from Xerox

• Brute Force solution – compare songs to themselves, splitting them into tiny chunks and marking light for similar and dark for dissimilar– Disassociated Studio will do this for you

Day Tripper from the Beatles…Music shows internal pattern.

•

So does MPEG.

What Exactly Are We Doing

• Jonathan Helman’s“DotPlot Patterns: ALiteral Look at PatternLanguages” offers anintroduction

• Instead of “to, be, not” etc, we use chunks of data from arbitrary files– The same similarity metric used to

disambiguate names for the SSH hack, is used to measure similarity here

There are so many patterns we might see…

…and no matter how much we’ve learned of this pattern language…

???

So How Might This Be Useful?

• A) Format Identification– 1) Do different file formats appear different?– 2) Do different instances of the same file

format appear similar?– 3) Does one format embedded in another

make itself apparent?

• B) Fuzzer Guidance– 1) Can we locate the actual byte offsets

where one section ends and another begins?– 2) Can we visualize and compare fuzzer

operations via Dotplots?

Format Identification

• 1) Do different files appear different, and does the appearance reflect the existence of internal structure?

• 2) Do different instances of the same file format appear similar?

• 3) Does one format embedded in another make itself apparent?

Java Class Files

.NET Assemblies

CNN’s Home Page

SMBTorture Traffic(Packets – Note, Stop/Start Is Visible)

Kernel32.dll

Chromosome 22(This is, after all, a genomics hack)

The Legend Of Zelda


• 1) Do different files appear different, and does the appearance reflect the existence of internal structure?– Answer: Yes. They do.

• 2) Do different instances of the same file format appear similar?


Books from Project Gutenberg:Consistent

Despite English’s low information content, lack of even mildly related strings causes little self-similarity across symbol clusters

US Code:Moderately Consistent

Legalese is a massively structured dialect. Symbols appear in very distinct patterns that are more reminiscent of machine code than text.

HTML:Consistent

HTML repeats smaller symbols (tags) and larger symbol clusters (via template engines) regularly. This shows up visually as a tightly repeating pattern.

Java Class Files (Compared):Mildly Consistent

Binary code (be it bytecode or x86) tends to be very structured. Still, we are dependent on both the content and the compiler to generate distinct patterns.

x86:Consistent (In Sections)

x86 tends not to be handwritten; as such complex instructions are emitted in a highly structured form.

Exception?

• 64 kilobyte graphical demonstration

• Run through a packer

• Compression removes patterns

NES Games

6502 Assembly Tends To Show Consistent Patterns, But…

Mario Games Look Rather Different.

1) Output is highly dependent on the compiler

2) Output is highly dependent upon the actual content

File formats are merely shells for actual content. You are analyzing the content; the format is just syntactic sugar.



• 2) Do different instances of the same file format appear similar?– Answer: Somewhat. Similar content looks

like itself, but you’re measuring the fundamental entropy of the underlying content, not the format of the content itself.


File Formats Contain Multiple SubformatsAnother Look At Kernel32.DLL

These are all different parts of Kernel32.

Quickly Browsing Large Files:Tilt-Shift View

• Instead of measuring absolute Y against absolute X, make X relative– Advance through the

file going down, look back a number of bytes going right

Complain All You Want.Hex Still Sucks.



• 2) Do different instances of the same file format appear similar?– Answer: Somewhat. Similar content looks like itself,

but you’re measuring the fundamental entropy of the underlying content, not the format of the content itself.

• 3) Does one format embedded in another make itself apparent?– Answer: Yes. Multiple, distinct sections

are clearly visible in a way that hex cannot show.

Fuzzer Guidance

• 1) Can we locate the actual byte offsets where one section ends and another begins?– Why would we want to?

• Fuzzers break parsers.• Many subformats to a format, many subparsers to a parser• To a rough level of approximation, fuzzing a single subformat

lets you stress a single subparser• So once we split a file up, we can selectively attack one

subparser at a time.

• 2) Can we visualize and compare fuzzer operations via Dotplots?

Simple Math

We select an interesting blob from kernel32.dll. The blob is at pixel offset 507x507, and is a square around 570 pixels wide.

Window size on viz was 32.

507*32 = The interesting section starts 16224 bytes into the file.

570*32 = The interesting section is 18240 bytes long.

Whats The Actual Data?dd if=kernel32.dll bs=1 skip=16100

| hexdump - | more

Using Hardcorr as a “first knife” to locate interesting-to-fuzz regions

Fuzzer Guidance

• 1) Can we locate the actual byte offsets where one section ends and another begins?– Answer: Yes. We can quickly route from the image

to the byte offset, through basic arithmetic.


Differentials

• Major use of dotplots in bioinformatics is to compare one genome against another– Autocorrelation: Compare A to A– Cross-Correlation: Compare A to B

• Most files are sufficiently dissimilar that not very interesting structure shows up– Notable exception: Different versions of

the same binary

Visual Bindiff!

MSVCR70.DLL v. MSVCR71.DLL

Fuzzers:Very Broken Patchers

Mangle.C – Single Bit Differences

CFG9000 – Large Scale Reordering

Fuzzer Guidance

• 1) Can we locate the actual byte offsets where one section ends and another begins?– Answer: Yes. We can quickly route from the image

to the byte offset, through basic arithmetic.


– Answer: Yes – visual diffing effectively shows differences between files, including differences introduced by various flavors of fuzzers.

Documents

Dmk blackops2006 ccc