Chapter 2 data compression for data security · 2018-06-04 · Data compression (M-ary 1) Han Vinck 2013 32 - Suppose that a source generates N independent M-ary symbols - The frequency

Diffusion and Data compression for

data security

A.J. Han VinckUniversity of Duisburg/Essen

April [email protected]

content

• Why diffusion is important?

• Why data compression is important?

• Unicity distance– Time to discover a secret

• Source coding principle• How data compression works• Zipf law

Han Vinck 2013 2

3

Diffusion-transposition

HOW:

rearrange the symbols in the data without changing the symbols

i.e. the frequency of symbols remains the same

GOAL:

destroy the relations between symbols

and make it more difficult to analyze!

ANALYSIS:

index of Co-incidence, finding periods

Han Vinck 2013

4

example of diffusion

a scytale is a tool used to perform a transposition cipher• http://www.youtube.com/watch?v=VeH0KnZtljY&feature=related

Han Vinck 2013

5

substitution

transposition

substitution

General round structure

Substitute bytes

Shift rows

Mix columns

Add round key

Same equipment can be used to decipher

http://www.youtube.com/watch?v=mlzxpkdXP58

Confusion and diffusion in AES

Han Vinck 2013

Data compression

Han Vinck 2013 6

The goal of data compression is to create

- a compact representation of the data to be encrpyted

- create independent symbols

Decompression gives the original data back!

Data compression

Han Vinck 2013 7

8

Part 1 Part 2 ••• Part n (for example every part 56 bits)

•••

key

••• n cryptograms,

encypher

Part 1

decypher•••

Part 2 Part n

Attacker:

n cryptograms to analyze for particular message of n

parts

key

dependancy exists between parts of the message

dependancy exists between cryptograms

Source coding in Message encryption (1)

Han Vinck 2013

9

Part 1 Part 2 •••Part n

1 cryptogram

source encode

encypherkey

decypher

Source decode

Part 1 Part 2 •••Part n

Attacker:

- 1 cryptogram to analyze for particular message

of n parts

- assume data compression factor n-to-1

Hence, less material for the same message!

(for example every part 56 bits)

n-to-1

Source coding in Message encryption (2)

Han Vinck 2013

10

The position of crypto in a Communication model

source Analogue to digital

conversion

compression/reduction security error

protection

from bit to signal

digital

Han Vinck 2013

11

Source coding

Two principles:

data reduction: remove irrelevant data (lossy, gives errors)

data compression: present data in compact (short) way (lossless)

remove irrelevance

original data compact

description

Relevant data

„unpack“„original data“

Transmitter side

receiver side

Han Vinck 2013

Illustration lossless/lossy

Han Vinck 2013 12

original

≈ original

What do we want (need)?

Han Vinck 2013 13

All data symbols to be enciphered must

occur with equal probability

and

are independent from each other

14

Example:• suppose we have a dictionary with 30.000 words

• these can be numbered (encoded) with 15 bits

• if the average word length is 5, we need „on the average“ 3 bits per letter

01000100

Han Vinck 2013

This can happen

Han Vinck 2013 15

Letter frequency of the vigenere cipher

Han Vinck 2013 16

How to compres? (binary 1)

Han Vinck 2013 17

- #0‘s = f0 N, #1‘s = f1 N; F = (f0, f1) the composition of x

- Then, the number of different vectors x for a given F is

source x = (x1, x2, •••, xN ), xi Є {0,1}

)(entropy! flogf-=Nlog+Nflogf-≈|x|logN1

xrepresent toneeded symbolbits/ ofnumber theand

N)!(fN)!(fN!

=Nf

N=|x|

∑∑1

0=ii2i2

1

0=ii2iF2

100F

en- and decoding

Han Vinck 2013 18

sourcex

N largefor 0letter /output bits 1)(NlogN1 need weF, of valuethe transmit To

entropy!Shannon the toequal is flogf thusand pfN, largefor

2

1

0ii2iii

encoder

F (composition)

Lexicographical index for xencoder

encoderF (composition)

Lexicographical index for xdecoder

x

Lexicographical en- and decoding is a solved problem in computer science

N letters

exercise

• For sequences of length 12 with 4 ones and 6 zeros,

– give the lexicographical index for the sequence 1 0 0 1 0 0 1 0 0 1 0 0

– What is the sequence that belongs to the index 512

Han Vinck 2013 19

20

Binary entropy

interpretation:

let a binary sequence contain pn ones, then we can specify each sequence with

log2 2nh(p) = n h(p) bits

( )2nh pnpn

)p(h

pnn

logn1lim 2n

Homework: Prove the approximation using ln N! ~ N lnN for N large.

Use also logax = y logb x = y logba

The Stirling approximation ! 2 N NN N N e

Han Vinck 2013

21

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

h

The Binary Entropy: h(p) = -plog2p – (1-p) log2 (1-p)

Note:

h(p) = h(1-p)

Han Vinck 2013

references

• Information theory books• MPEG, JPEG, …

Han Vinck 2013 22

Application to text: symbols are words

Han Vinck 2013 23

The distribution of words follows the law of Zipf(1935):

Let fn denote the frequency of the n-th most frequent word,

then fn = A/n.

English: A = 0.1

2.16 9.72/4.5 r bits/lette ofnumber The

letter; 4.5 h wordlengtaverage the

9.72; flogf 12366; Mfor M

1ii2i

Zipf‘s law

Han Vinck 2013 24

A web site with many references and applications

http://linkage.rockefeller.edu/wli/zipf/index_ru.html

Web Sites rank ordered by their popularity

Han Vinck 2013 25

Unicity distance (3)

• Idea:

- for a stream cipher after some time L,

the plaintext and keystream can be determined uniquely from the cipher stream

The smallest value where this is possible is called UNICITY DISTANCE U

A necessary condition:

|ML| x |K| |CL| , where | * | means cardinality (or # of)

Han Vinck 2013 26

(otherwise, when |ML| x |K| > |CL|, some plaintexts give the same cipher)

Unicity distance (4)

Han Vinck 2013 27

infinity to goes U ,redundancy low For! sequence source the of redundancy maximum the is R |M|log

:NOTE to IMPORTANT

|C| |M|where,R |M|log

|K|logU

and

L

L|M|log

L|C|log

|K|log:have we

|C|log |C|log |K|log |M|log from

1

11

LL

LLL

Han Vinck 2013 28

A probabilistic approach (Hellmann)

1 1

2 2

|ML| |CL |

Equal probable messages,

Equal probable keys

|K|

-)[z(c consider:proof

na)(cza )z(c z

incoming of# arrows outgoing of# the :used we

c entering arrows of number the )isz(c where|K|x|M|

)z(c)P(c and )z(c |K|x|M|

n

1ii

2n

1ii

2n

1ii

ii

Li

i|C|

1iiL

L

,

0 ≥]na

-)[z(cconsider :fproo

na≥)(cz=>a= )z(c :used we

C)M,pair unique (one before asresult same thegives 1=z

|C||K|x|M|

≥|K|x|M|

)(cz=)P(c)z(c=z

incoming of # = arrows outgoing of # the :used we

c entering arrows ofnumber the)isz(c where

,|K|x|M|

)z(c=)P(c and )z(c= |K|x|M|

2n

1=ii

2n

1=ii

2n

1=ii

L

L|C|

1=i L

i2

ic

i

ii

L

ii

|C|

1=iiL

∑

∑∑

∑∑

∑

L

i

L

Examples: Unicity distance (5)

Han Vinck 2013 29

Assume that the German language has a rate R of 2 bits per letter

- Then, for

a substitution cipher with 26! keys or

a permutation cipher with period 26 ( 26! keys )

- For a Vigenere cipher of length 80:

- Try to find U for the DES

322log26

log26!R |M|log

|K|logU

:have we

1

1402log26

log26R |M|log

|K|logU

:have we80

1

Conclusion: Unicity distance (6)

Han Vinck 2013 30

It is important to make the value of R as high as possible for a large U

Hence:

source compression before encryption is important for secure communications

Note added: Given the message to the analyst, the value of R = 0.

|M|log|K|logUplaintext, and ciphertext the given Hence,

1

Han Vinck 2013 31

THE MARCONI FELLOWS

1999 - Professor James L. MasseyMarconi Award citation

"For theoretical and practical contributions to cryptography and related coding problems; teacher and mentor to a generation of scientists and technologists"

Professor Massey made significant advances in forward-error-correcting codes, multi-user communications, and cryptographic systems. In addition, Professor Massey is known for his contributions to the field of engineering education. He is currently an Adjunct Professor at the University of Lund, Sweden.

Professor James L. Massey

A GREAT SCIENTIST and TEACHER!MOTTO: SIMPLE but SOLID

Data compression (M-ary 1)

Han Vinck 2013 32

- Suppose that a source generates N independent M-ary symbols

- The frequency of a symbol i is fi and thus fi N symbols i occur in x

- We call F = (f1, f2, •••, fM ) the composition of x

- Then, the number of different vectors x for a given F is

source x = (x1, x2, •••, xN ), xi Є {1,2, •••,M}

)(entropy! flogfNlogNflogf|x|logN1

x represent to needed symbol bits/ of number the andN)!(fN)!(fN)!(f

N!Nf

NfNfNfNNf

NfNNf

N|x|

M

1ii2i2

M

1ii2iF2

M21M

1M21

2

1

1F

en- and decoding

Han Vinck 2013 33

sourcex

N largefor 0letter /output bits 1)(NlogN

1-M need weF, of valuethe transmit To

entropy!Shannon the toequal is flogf thusand pfN, largefor

2

M

1ii2iii

encoder

F (composition)

Lexicographical index for x

encoder

encoderF (composition)

Lexicographical index for x

decoderx

Lexicographical en- and decoding is a solved problem in computer science

N letters

Documents

Chapter 2 data compression for data security · 2018-06-04 · Data compression (M-ary 1) Han Vinck 2013 32 - Suppose that a source generates N independent M-ary symbols - The frequency