View
141
Download
3
Category
Preview:
Citation preview
Attacking Turkish Texts Encrypted byHomophonic Cipher
Sefik Ilkin Serengil
Galatasaray University
joint work with Murat Akin
EHAC Conference, Cambridge, UK
February 20, 2011
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 2 / 22
Outline
1. History
2. Homophonic Cipher
3. Key Space
4. Attacking
5. Conclusion
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 3 / 22
• Earliest example in 1401
• By Francesco I Gonzaga
• Duke of Mantua
(Place in Northern Italy)
• To Simone de Crema
• In Personal Correspondence
• Ref. to W. T. Penzhorn, 1994
History of Homophonic Cipher
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 4 / 22
Homophonic Cipher
• Meaning same sound
• Extended version of substitution cipher
• Replaces each letter with different symbols proportional to its frequency value
• One-to-many mapping
• Different ciphertexts for same plaintext
• Resistant to frequency analysis attacks
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 5 / 22
Homophonic Function
• Suppose that working on a language (A)
• Language A consists of 2 letters (a, b)
• Probability of letters in language: pa = 3/4, pb = 1/4
• Encryption: a → {00, 01, 10}, b → {11}
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 6 / 22
Homophonic Enciphered Texts
• Could not be distinguished frequencies
• Frequency distribution is manipulated and smoothed
• Relatively equal frequency values in ciphertext
• Each symbol takes space of about 1%
• Changing block sizes
• Combine monoalphabetic and polyalphabetic features
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 7 / 22
Frequency of A Sample Encrypted Text
• X-axis: symbol in ciphertext, Y-axis: frequency
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 8 / 22
Turkish Unigrams in Homophonic Cipher
• Replacing each letter with different symbols proportional to its frequency
Letter Freq Symb Letter Freq Symb Letter Freq Symb
A 11,92 12 I 5,114 5 R 6,722 7
B 2,844 3 İ 8,600 9 S 3,014 3
C 0,963 1 J 0,034 1 Ş 1,78 2
Ç 1,156 1 K 4,683 5 T 3,314 3
D 4,706 5 L 5,922 6 U 3,235 3
E 8,912 9 M 3,752 4 Ü 1,854 2
F 0,461 1 N 7,484 7 V 0,959 1
G 1,253 1 O 2,476 2 Y 3,336 3
Ğ 1,125 1 Ö 0,777 1 Z 1,500 2
H 1,212 1 P 0,886 1
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 9 / 22
Sample Homophonic Encryption
• Plaintext: FLORYA
• Ciphertext: 010 084 000 035 075 067
010 026 005 042 052 098
010 043 000 029 021 053
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 10 / 22
Key Space of Homophonic Cipher
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 11 / 22
Key Spaces of Common Algorithms
Algorithm Type Key Space Key Space #2
Substitution Cipher Classical 29! 1030
Homophonic Cipher Classical ~80! 10119
DES Modern 256 1016
3DES Modern 2112 1033
IDEA Modern 2128 1038
AES Modern 2256 1077
Camellia Modern 2256 1077
Twofish Modern 2256 1077
Serpent Modern 2256 1077
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 12 / 22
Brute Force Attack
• Key size of homophonic cipher is about 10119
• Larger than most of modern cryptosystems
• Suppose that attacker could check a possibility in microseconds (10-6)
• Brute force lasts more than 10105 years in worst case
(10119 x 10-6) / (60 x 60 x 24 x 365) > 10105
• Age of the universe is 1014 years (C. K. Koc, 2010)
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 13 / 22
Really need brute force attack?
Of course no!
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 14 / 22
Attacking Approach
• N-gram frequencies should be analyzed
• This helps to detect patterns on ciphertext
• Looking for pattern
– Easy to detect (repeating)
– Most probably appear in ciphertext (high frequent)
• High frequent n-grams consist of low frequent unigrams
• Thus, n-gram would most probably appear more than one as same symbol
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 15 / 22
Work on Data
• To discover high frequent n-grams consisting of lowfrequent unigrams
• The corpus size of 13.4 MB analyzed
• Consisting of 11M characters
• Including 37 novels of 9 authors
• 120 articles of a columnist, Cetin Altan
• Long processing time (NP-complete problem)
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 16 / 22
Bigram frequencies in 11M
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 17 / 22
Trigram frequencies in 11M
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 18 / 22
Tetragram frequencies in 11M
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 19 / 22
Pentagram frequencies in 11M
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 20 / 22
Attacking
• The following bigram is high frequent one
• Also, it consists of low frequent unigrams
• Suppose that G->006, Ö->072 for encryption
• Then, 006 072 would appear more than once
• Similar approach should be applied for other high frequent ngrams consisting of low frequent unigrams
Bigram Frequency Symbol
GÖ 25203/11M 1
Letter Frequency Symbol
G 1,253% 1
Ö 0,777% 1
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 21 / 22
A Sample Attack
• Detection of the Bigram is too easy as seen
Sefik Ilkin Serengil EHAC, UK, February 2011 p. 22 / 22
Conclusion
• No previous work in Turkish for the historical method
• Wider key space than most of modern cryptosystems
• Of course, no need brute force attack
• N-gram based attack could help to detect patterns
• Need long ciphertexts to attack
• Could help for linguistics
Thank you for your attention!
Recommended