23
Attacking Turkish Texts Encrypted by Homophonic Cipher Sefik Ilkin Serengil Galatasaray University joint work with Murat Akin EHAC Conference, Cambridge, UK February 20, 2011

Attacking Turkish Texts Encrypted by Homophonic Cipher

Embed Size (px)

Citation preview

Page 1: Attacking Turkish Texts Encrypted by Homophonic Cipher

Attacking Turkish Texts Encrypted byHomophonic Cipher

Sefik Ilkin Serengil

Galatasaray University

joint work with Murat Akin

EHAC Conference, Cambridge, UK

February 20, 2011

Page 2: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 2 / 22

Outline

1. History

2. Homophonic Cipher

3. Key Space

4. Attacking

5. Conclusion

Page 3: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 3 / 22

• Earliest example in 1401

• By Francesco I Gonzaga

• Duke of Mantua

(Place in Northern Italy)

• To Simone de Crema

• In Personal Correspondence

• Ref. to W. T. Penzhorn, 1994

History of Homophonic Cipher

Page 4: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 4 / 22

Homophonic Cipher

• Meaning same sound

• Extended version of substitution cipher

• Replaces each letter with different symbols proportional to its frequency value

• One-to-many mapping

• Different ciphertexts for same plaintext

• Resistant to frequency analysis attacks

Page 5: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 5 / 22

Homophonic Function

• Suppose that working on a language (A)

• Language A consists of 2 letters (a, b)

• Probability of letters in language: pa = 3/4, pb = 1/4

• Encryption: a → {00, 01, 10}, b → {11}

Page 6: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 6 / 22

Homophonic Enciphered Texts

• Could not be distinguished frequencies

• Frequency distribution is manipulated and smoothed

• Relatively equal frequency values in ciphertext

• Each symbol takes space of about 1%

• Changing block sizes

• Combine monoalphabetic and polyalphabetic features

Page 7: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 7 / 22

Frequency of A Sample Encrypted Text

• X-axis: symbol in ciphertext, Y-axis: frequency

Page 8: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 8 / 22

Turkish Unigrams in Homophonic Cipher

• Replacing each letter with different symbols proportional to its frequency

Letter Freq Symb Letter Freq Symb Letter Freq Symb

A 11,92 12 I 5,114 5 R 6,722 7

B 2,844 3 İ 8,600 9 S 3,014 3

C 0,963 1 J 0,034 1 Ş 1,78 2

Ç 1,156 1 K 4,683 5 T 3,314 3

D 4,706 5 L 5,922 6 U 3,235 3

E 8,912 9 M 3,752 4 Ü 1,854 2

F 0,461 1 N 7,484 7 V 0,959 1

G 1,253 1 O 2,476 2 Y 3,336 3

Ğ 1,125 1 Ö 0,777 1 Z 1,500 2

H 1,212 1 P 0,886 1

Page 9: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 9 / 22

Sample Homophonic Encryption

• Plaintext: FLORYA

• Ciphertext: 010 084 000 035 075 067

010 026 005 042 052 098

010 043 000 029 021 053

Page 10: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 10 / 22

Key Space of Homophonic Cipher

Page 11: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 11 / 22

Key Spaces of Common Algorithms

Algorithm Type Key Space Key Space #2

Substitution Cipher Classical 29! 1030

Homophonic Cipher Classical ~80! 10119

DES Modern 256 1016

3DES Modern 2112 1033

IDEA Modern 2128 1038

AES Modern 2256 1077

Camellia Modern 2256 1077

Twofish Modern 2256 1077

Serpent Modern 2256 1077

Page 12: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 12 / 22

Brute Force Attack

• Key size of homophonic cipher is about 10119

• Larger than most of modern cryptosystems

• Suppose that attacker could check a possibility in microseconds (10-6)

• Brute force lasts more than 10105 years in worst case

(10119 x 10-6) / (60 x 60 x 24 x 365) > 10105

• Age of the universe is 1014 years (C. K. Koc, 2010)

Page 13: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 13 / 22

Really need brute force attack?

Of course no!

Page 14: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 14 / 22

Attacking Approach

• N-gram frequencies should be analyzed

• This helps to detect patterns on ciphertext

• Looking for pattern

– Easy to detect (repeating)

– Most probably appear in ciphertext (high frequent)

• High frequent n-grams consist of low frequent unigrams

• Thus, n-gram would most probably appear more than one as same symbol

Page 15: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 15 / 22

Work on Data

• To discover high frequent n-grams consisting of lowfrequent unigrams

• The corpus size of 13.4 MB analyzed

• Consisting of 11M characters

• Including 37 novels of 9 authors

• 120 articles of a columnist, Cetin Altan

• Long processing time (NP-complete problem)

Page 16: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 16 / 22

Bigram frequencies in 11M

Page 17: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 17 / 22

Trigram frequencies in 11M

Page 18: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 18 / 22

Tetragram frequencies in 11M

Page 19: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 19 / 22

Pentagram frequencies in 11M

Page 20: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 20 / 22

Attacking

• The following bigram is high frequent one

• Also, it consists of low frequent unigrams

• Suppose that G->006, Ö->072 for encryption

• Then, 006 072 would appear more than once

• Similar approach should be applied for other high frequent ngrams consisting of low frequent unigrams

Bigram Frequency Symbol

GÖ 25203/11M 1

Letter Frequency Symbol

G 1,253% 1

Ö 0,777% 1

Page 21: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 21 / 22

A Sample Attack

• Detection of the Bigram is too easy as seen

Page 22: Attacking Turkish Texts Encrypted by Homophonic Cipher

Sefik Ilkin Serengil EHAC, UK, February 2011 p. 22 / 22

Conclusion

• No previous work in Turkish for the historical method

• Wider key space than most of modern cryptosystems

• Of course, no need brute force attack

• N-gram based attack could help to detect patterns

• Need long ciphertexts to attack

• Could help for linguistics

Page 23: Attacking Turkish Texts Encrypted by Homophonic Cipher

Thank you for your attention!