Practical Attacks Against Encrypted VoIP Communications

Prac%cal A)acks Against Encrypted VoIP Communica%ons

HITBSECCONF2013: Malaysia Shaun Colley & Dominic Chell

@domchell @mdseclabs

Agenda

•  This is a talk about traffic analysis and paHern matching

•  VoIP background •  NLP techniques •  StaNsNcal modeling •  Case studies aka “the cool stuff”

•  VoIP is a popular replacement for tradiNonal copper-‐wire telephone systems

•  Bandwidth efficient and low cost •  Privacy has become an increasing concern •  Generally accepted that encrypNon should be used for end-‐to-‐end security

•  But even if it’s encrypted, is it secure?

Introduc%on

Why?

•  Widespread accusaNons of wiretapping •  Leaked documents allegedly claim NSA & GCHQ have some “capability” against encrypted VoIP

•  “The fact that GCHQ or a 2nd Party partner has a capability against a specific the encrypted used in a class or type of network communica@ons technology. For example, VPNs, IPSec, TLS/SSL, HTTPS, SSH, encrypted chat, encrypted VoIP”.

•  LiHle work has been done by the security community

•  Some interesNng academic research –  Uncovering Spoken Phrases in Encrypted Voice over IP CommunicaNons: Wright, Ballard, Coull, Monrose, Masson

–  Uncovering Spoken Phrases in Encrypted VoIP ConversaNons: Doychev, Feld, Eckhardt, Neumann

•  Not widely publicised •  No proof of concepts

Previous Work

Background: VoIP

•  Similar to tradiNonal digital telephony, VoIP involves signalling, session iniNalisaNon and setup as well as encoding of the voice signal

•  Separated in to two channels that perform these acNons: – Control channel – Data channel

VoIP Communica%ons

•  Operates at the applicaNon-‐layer •  Handles call setup, terminaNon and other essenNal aspects of the call

•  Uses a signalling protocol such as: – Session IniNaNon Protocol (SIP) – Extensible Messaging and Presence Protocol (XMPP)

– H.323 – Skype

Control Channel

Control Channel

•  Handles sensiNve call data such as source and desNnaNon endpoints, and can be used for modifying exisNng calls

•  Typically protected with encrypNon, for example SIPS which adds TLS

•  Ocen used to establish the the direct data connecNon for the voice traffic in the data channel

•  The primary focus of our research •  Used to transmit encoded and compressed voice data

•  Typically over UDP •  Voice data is transported using a transport protocol such as RTP

Data Channels

•  Commonplace for VoIP implementaNons to encrypt the data flow for confidenNality

•  A common implementaNon is Secure Real-‐Time Transport Protocol (SRTP)

•  By default will preserve the original RTP payload size

•  “None of the pre-‐defined encryp@on transforms uses any padding; for these, the RTP and SRTP payload sizes match exactly.”

Data Channels

Background: Codecs

•  Used to convert the analogue voice signal in to a digitally encoded and compressed representaNon

•  Codecs strike a balance between bandwidth limitaNons and voice quality

•  We’re mostly interested in Variable Bit Rate (VBR) codecs

Codecs

•  The codec can dynamically modify the bitrate of the transmiHed stream

•  Codecs like Speex will encode sounds at different bitrates

•  For example, fricaNves may be encoded at lower bitrates than vowels

Variable Bitrate Codecs

•  The primary benefit from VBR is a significantly beHer quality-‐to-‐bandwidth raNo compared to CBR

•  Desirable in low bandwidth environments – Cellular – Slow WiFi

Variable Bitrate Codecs

Background: NLP and Sta%s%cal Analysis

•  Research techniques borrowed from NLP and bioinformaNcs

•  Primarily the use of: – Profile Hidden Markov Models – Dynamic Time Warping

Natural Language Processing

•  StaNsNcal model that assigns probabiliNes to sequences of symbols

•  TransiNons from Begin state (B) to End state (E)

•  Moves from state to state randomly but in line with transiNon distribuNons

•  TransiNons occur independently of any previous choices

Hidden Markov Models

•  The model will conNnue to move between states and output symbols unNl the End state is reached

•  The emiHed symbols consNtute the sequence


Image from hHp://isabel-‐drost.de/hadoop/slides/HMM.pdf

•  A number of possible state paths from B to E •  Best path is the most likely path •  The Viterbi algorithm can be used to discover the most probable path

•  Viterbi, Forward and Backward algorithms can all be used to determine probability that a model produced an output sequence


•  The model can be “trained” by a collecNon of output sequences

•  The Baum-‐Welch algorithm can be used to determine probability of a sequence based on previous sequences

•  In the context of our research, packet lengths can be used as the sequences


•  A variaNon of HMM •  Introduces Insert and Deletes •  Allows the model to idenNfy sequences with Inserts or Deletes

•  ParNcularly relevant to analysis of audio codecs where idenNcal uHerances of the same phrase by the same speaker are unlikely to have idenNcal paHerns

Profile Hidden Markov Models

•  Consider a model trained to recognise: A B C D

•  The model can sNll recognise paHerns with inser&on:

A B X C D

•  Or paHerns with dele&on: A B C

Profile Hidden Markov Models

•  Largely replaced by HMMs •  Measures similarity in sequences that vary in Nme or speed

•  Commonly used in speech recogniNon •  Useful in our research because of the temporal element

•  A packet capture is essenNally a Nme series

Dynamic Time Warping

•  Computes a ‘distance’ between two Nme series – DTW distance

•  Different to Euclidean distance

•  The DTW distance can be used as a metric for ‘closeness’ between the two Nme series

Dynamic Time Warping

Dynamic Time Warping -‐ Example •  Consider the following sequences:

–  0 0 0 4 7 14 26 23 8 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 –  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 6 13 25 24 9 4 2 0 0 0 0 0

•  IniNal analysis suggests they are very different, if comparing from the entry points.

•  However there are some similar characterisNcs: –  Similar shape –  Peaks at around 25 –  Could represent the same sequence, but at different Nme

offsets?

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Series1

Series3

Side Channel A)acks

•  Usually connecNons are peer-‐to-‐peer

•  We assume that encrypted VoIP traffic can be captured: –  Man-‐in-‐the-‐middle –  Passive monitoring

•  Not beyond the realms of possibility: –  “GCHQ taps fibre-‐opNc cables” hHp://www.theguardian.com/uk/2013/jun/21/gchq-‐cables-‐secret-‐world-‐communicaNons-‐nsa

–  “China hijacked Internet traffic”hHp://www.zdnet.com/china-‐hijacked-‐uk-‐internet-‐traffic-‐says-‐mcafee-‐3040090910/

Side Channel A)acks

•  But what can we get from just a packet capture?

Side Channel A)acks

•  Source and DesNnaNon endpoints – Educated guess at language being spoken

•  Packet lengths

•  Timestamps

Side Channel A)acks

•  So what?......

•  We now know VBR codecs encode different sounds at variable bit rates

•  We now know some VoIP implementaNons use a length preserving cipher to encrypt voice data

Side Channel A)acks

Variable Bit Rate Codec + Length Preserving Cipher =

Side Channel A)acks

Case Study

•  ConnecNons are peer-‐to-‐peer •  Uses the Opus codec (RFC 6716):

“Opus is more efficient when opera@ng with variable bitrate (VBR) which is the default”

•  Skype uses AES encrypNon in integer counter mode

•  The resulNng packets are not padded up to size boundaries

Skype Case Study

Skype Case Study

•  Although similar phrases will produce a similar paHern, they won’t be idenNcal: – Background noise – Accents – Speed at which they’re spoken

•  Simple substring matching won’t work!

Skype Case Study

•  The two approaches we chose make use of the NLP techniques: – Profile Hidden Markov Models – Dynamic Time Warping

Skype Case Study

•  Both approaches are similar and can be broken down in the following steps: –  Train the model for the target phrase –  Capture the Skype traffic –  “Ask” the model if it’s likely to contain the target phrase

Skype Case Study

•  To “train” the model, a lot of test data is required

•  We used the TIMIT Corpus data

•  Recordings of 630 speakers of eight major dialects of American English

•  Each speaker reads a number of “phoneNcally rich” sentences

Skype Case Study -‐ Training

“Why do we need bigger and beHer bombs?”

Skype Case Study -‐ TIMIT

“He ripped down the cellophane carefully, and laid three dogs on the Nn foil.”


“That worm a murderer?”


•  To collect the data we played each of the phrases over a Skype session and logged the packets using tcpdump

for((a=0;a<400;a++)); do /Applications/VLC.app/Contents/MacOS/VLC --no-repeat -I rc --play-and-exit $a.rif ; echo "$a " ; sleep 5 ; done !


•  PCAP file containing ~400 occurrences of the same spoken phrase

•  “Silence” must be parsed out and removed

•  Fairly easy -‐ generally, silence observed to be less than 80 bytes

•  Unknown spikes to ~100 during silence phases


Skype Case Study -‐ Silence

Short excerpt of Skype traffic of the same recording captured 3 Nmes, each separated by 5 seconds of silence:

Approach to idenNfy and remove the silence:

–  Find sequences of packets below the silence threshold, ~80 bytes

–  Ignore spikes when we’re in a silence phase (i.e. 20 conNnuous packets below the silence threshold)

–  Delete the silence phase –  Insert a marker to separate the speech phases – integer 222, in our case

–  This leaves us with just the speech phases…..



•  Biojava provides a useful open source framework –  Classes for Profile HMM modeling –  BaumWelch for training –  A dynamic matrix programming class (DP) for calling into Viterbi for sequence analysis on the PHMM

•  We chose this library to implement our aHack

Skype Case Study – PHMM A)ack

•  Train the ProfileHMM object using the Baum Welch

•  Query Viterbi to calculate a log-‐odds

•  Compare the log-‐odds score to a threshold

•  If above threshold we have a possible match

•  If not, the packet sequence was probably not the target phrase

Skype Case Study – PHMM A)ack

•  Same training data as PHMM •  Remove silence phases •  Take a prototypical sequence and calculate DTW distance of all training data from it

•  Determine a typical distance threshold •  Calculate DTW distance for test sequence and compare to threshold

•  If the distance is within the threshold then likely match

Skype Case Study – DTW A)ack

PHMM Demonstra%on

Skype Case Study – Pre Tes%ng

Skype Case Study – Post Tes%ng Cypher: “I don’t even see the code. All I see is blonde, bruneHe, red-‐head”

•  Recall rate of approximately 80%

•  False posiNve rate of approximately 20%

•  PhoneNcally richer phrases will yield lower false posiNves

•  TIMIT corpus: “Young children should avoid exposure to contagious diseases”

PHMM Sta%s%cs

DTW Results

•  Similarly to PHMM results, ~80% recall rate

•  False posiNve rate of 20% and under – again, as long as your training data is good.

Silent Circle -‐ Results

•  Not vulnerable – all data payload lengths are 176 bytes in length!

Wrapping up

•  Some guidance in RFC656216

•  Padding the RTP payload can provide a reducNon in informaNon leakage

•  Constant bitrate codecs should be negoNated during session iniNaNon

Preven%on

•  Assess other implementaNons –  Google Talk – Microsoc Lync –  Avaya VoIP phones –  Cisco VoIP phones –  Apple FaceTime

•  According to Wikipedia, uses RTP and SRTP…Vulnerable?

•  Improvements to the algorithms -‐ Apply the Kalman filter?

Further work

•  Variable bitrate codecs are unsafe for sensiNve VoIP transmission

•  It is possible to deduce spoken conversaNons in encrypted VoIP

•  VBR with length preserving encrypted transports like SRTP should be avoided

•  Constant bitrate codecs should be used where possible

Conclusions

@domchell @MDSecLabs

Technology

Practical Attacks Against Encrypted VoIP Communications