statistic – Cryptography

Cryptanalysis of the Caesar Cipher

If you need a reminder on how the Caesar Cipher works click here.

The Caesar Cipher is a very easy to crack as there are only 25 unique keys so we can test all of them and score how English they are using either Chi-Squared Statistic or N-Gram Probability.

Example

Ciphertext of “RCZIOCZXGJXFNOMDFZNORZGQZVOOVXF”

Shift | Decrypted Text                 | Chi-Sq Score
1       QBYHNBYWFIWEMNLCEYMNQYFPYUNNUWE  201.327499
2       PAXGMAXVEHVDLMKBDXLMPXEOXTMMTVD  599.489345
3       OZWFLZWUDGUCKLJACWKLOWDNWSLLSUC  267.058510
4       NYVEKYVTCFTBJKIZBVJKNVCMVRKKRTB  325.267580
5       MXUDJXUSBESAIJHYAUIJMUBLUQJJQSA  775.163340
6       LWTCIWTRADRZHIGXZTHILTAKTPIIPRZ  434.880892
7       KVSBHVSQZCQYGHFWYSGHKSZJSOHHOQY  554.916606
8       JURAGURPYBPXFGEVXRFGJRYIRNGGNPX  340.923863
9       ITQZFTQOXAOWEFDUWQEFIQXHQMFFMOW  1012.384679
10      HSPYESPNWZNVDECTVPDEHPWGPLEELNV  115.358434
11      GROXDROMVYMUCDBSUOCDGOVFOKDDKMU  91.670467
12      FQNWCQNLUXLTBCARTNBCFNUENJCCJLT  283.701596
13      EPMVBPMKTWKSABZQSMABEMTDMIBBIKS  194.299832
14      DOLUAOLJSVJRZAYPRLZADLSCLHAAHJR  385.733449
15      CNKTZNKIRUIQYZXOQKYZCKRBKGZZGIQ  1520.292006
16      BMJSYMJHQTHPXYWNPJXYBJQAJFYYFHP  801.523128
17      ALIRXLIGPSGOWXVMOIWXAIPZIEXXEGO  603.683962
18      ZKHQWKHFORFNVWULNHVWZHOYHDWWDFN  280.874579
19      YJGPVJGENQEMUVTKMGUVYGNXGCVVCEM  269.610988
20      XIFOUIFDMPDLTUSJLFTUXFMWFBUUBDL  176.849244
21      WHENTHECLOCKSTRIKESTWELVEATTACK  51.921327
22      VGDMSGDBKNBJRSQHJDRSVDKUDZSSZBJ  460.236803
23      UFCLRFCAJMAIQRPGICQRUCJTCYRRYAI  262.108135
24      TEBKQEBZILZHPQOFHBPQTBISBXQQXZH  1373.411997
25      SDAJPDAYHKYGOPNEGAOPSAHRAWPPWYG  90.715517

As you can see the lowest Chi-Squared value is 51.921327, which was using a shift of 21. If you read the decrypted text for a shift of 21 you can indeed see that it is English. Hence cipher has been broken!

WIP

Index of Coincidence

Index of Coincidence is the probability that when selecting two letters from a text (without replacement), the two letters are the same. For a random piece of text with every letter having a chance of $\frac{1}{26}$ of appearing, the Index of Coincidence is also $\frac{1}{26}$ ( ${0.0385}$ ).

If the frequency of the letters are known and the sum of the frequencies is 1 then this formula can be used to calculate Index of Coincidence for a particular language.

$I.C=\sum_{i=A}^{i=Z}(F_{i})^{2}$
F_i is the frequency, in decimal form (10% = 0.1), of a letter in your text.

For for a generic piece of text written in English the Index of Coincidence is 0.0667, it is different for each language as the letter frequencies are different…

Language	Index of Coincidence
English	0.0667
French	0.0694
German	0.0734
Spanish	0.0729
Portuguese	0.0824
Turkish	0.0701
Swedish	0.0681
Polish	0.0607
Danish	0.0672
Icelandic	0.0669
Finnish	0.0699
Czech	0.0510

Values for this tabled created from the frequencies from Wikipedia. The values are for letters A-Z other letters such as ‘á’ or ‘â’ are considered to be the same as ‘a’, ‘ü’ or ‘ú’ are considered to be the same as ‘u’ etc…

However if you want to figure out the index of coincidence for a particular piece of text this formula can be used.

$IoC=\frac{\sum_{i=A}^{i=Z}C_{i}(C_{i}-1)}{L(L-1)}$
C_i is the count, of a letter in the text.
L_i is the total number of letters in the text.

If a letter does not appear more than once then is does not need to be involved in the calculation as when C_i is 1 or 0, C_ix (C_i – 1) will equal 0;

Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31

Letter	Count (C_i)	C_i(C_i– 1)
A	2	2
C	3	6
E	5	20
H	2	2
K	3	6
L	2	2
M	0	0
S	2	2
T	5	20
W	2	2
Total	31	62

$\frac{62}{31\times30}=0.0666$

This value is reasonably close to the expected Index of Coincidence value of English (0.0667). It is also much higher than that the expected Index of Coincidence of random text (0.0385) suggesting that this text is not random.

The larger the Index of Coincidence the more likely that there is some sort of language structure behind text. For example the Vigenère Cipher has an average Index of Coincidence of 0.042 – suggesting that the text is not random, which it is not.

Cryptanalysis of Hill Cipher

If you need a reminder on how the Hill Cipher works click here.

The first thing to note is that when encoding in Hill Cipher each row of the key matrix encodes to 1 letter independently of the rest of the key matrix.

$\begin{bmatrix}21 & 18 & 12 \\9 & 0 & 23 \\8 & 3 & 2 \end{bmatrix}\begin{bmatrix}a \\b \\c \end{bmatrix}=\begin{bmatrix}21a+18b+12 c \\9 a+0b+23c \\8a+3b+2c \end{bmatrix}\bmod 26$

Notice how the top row of the far left matrix is only involved in the top cell of the ciphertext matrix, the middle row is only involved in the middle cell etc.

We can use this fact to dramatically decrease the number of keys we have to test to break the Hill Cipher.

For square matrix of size N, there are 26^N×N unique keys (there will be less as not all matrices have an inverse). For N=3, there is 26⁹ ≈ 5.43×10¹² keys, to test all of these is not feasible (I calculated on my pc it would take ≈ 8 years to test them all).

However, if we test each row individually then there is only 26^N keys we need to test, For N=3 there is 26³ = 17,576 which is a very small number in comparison (Takes 0.5 seconds on my pc!)

With this property of Hill Cipher we can go about cracking it.

First you will need to identify N (the size of the matrix) the size will be a multiple of the text length – this narrows it down a lot

Now you will be to iterate over all the row vectors with a size of N and possible values of 0 (inclusive) to 26 (exclusive).

For a 3 by 3 there are 17,576 combinations. They look will look something like this. On the left is the iteration number…

1/17576         [ 0, 0, 0]
2/17576         [ 0, 0, 1]
3/17576         [ 0, 0, 2] ……
16249/17576     [24, 0, 24]
16250/17576     [24, 0, 25]
16251/17576     [24, 1, 0] ……
17576/17576     [25, 25, 25]

For each one of these possibilities assume it is part of the key and multiply your ciphertext by it, you will multiply in blocks of N and get a single letter out for each block.

$\begin{bmatrix}a & b & c \end{bmatrix} \begin{bmatrix}L_{1} \\L_{2} \\L_{3} \end{bmatrix}=\begin{bmatrix}a\times L_{1} + b\times L_{2} + c\times L_{3} \end{bmatrix}\bmod26$

Once you have all the output letters for a particular possibility, score the letters using the Chi-Squared Statistic. Store the row vectors from smallest to largest Chi-Squared value.

Once you have checked all the possibilities. Take the best results from the list you have compiled and then go through all the permutations of creating an N by N matrix and checking it has an inverse in modular 26.

Example:

Let’s say you know N=3 and the best row vectors found using this method were with a Chi-Squared value of… (note is some cases the best N vectors may not be the correct ones so you may need to try a combination of a few different ones)

[22, 6, 7]    X² = 71.721647
[23, 17, 18]    X² = 50.562860
[25, 0, 6]    X² = 81.987751

Rearranging each row to every possible position (For R number of rows there is R!, R×(R-1)×(R-2)…×1, permutations)

The next (3! = 6) matrices are all the permutations of each row vector.

$\begin{bmatrix}22 & 6 & 7 \\23 & 17 & 18\\25 & 0 & 6 \end{bmatrix} \begin{bmatrix}22 & 6 & 7 \\25 & 0 & 6\\23 & 17 & 18 \end{bmatrix} \begin{bmatrix}23 & 17 & 18 \\22 & 6 & 7\\25 & 0 & 6\end{bmatrix}$
$\begin{bmatrix}25 & 0 & 6 \\22 & 6 & 7\\23 & 17 & 18 \end{bmatrix}\begin{bmatrix}25 & 0 & 6 \\23 & 17 & 18\\22 & 6 & 7 \end{bmatrix}\begin{bmatrix}{23} & 17 & 18 \\25 & 0 & 6\\22 & 6 & 7\end{bmatrix}$

Then encrypt your ciphertext using these matrices (encrypting using the inverse key matrix is the same as decrypting using the key matrix). One of these results should be English – being your solution. If you wish to find the key matrix, you will need to inverse the inverse key matrix in mod 26.

To Conclude

For larger matrices like 4 by 4 and up the sheer number of keys make a brute force attack impossible, I don’t believe anyone has the patience or life expectancy to wait around 64 trillion years to solve one cipher. Other methods like crib dragging require you to guess or make assumptions for large chunks of the plaintext, a crib of 19+ characters very hard to come by. The method described above can solve a 4 by 4 Hill cipher in about 10 seconds, with no known cribs. The only thing it requires is that the text is of a certain length, about 100×(N-1) or greater when N is the size of the matrix being tested, so that statistical properties are not affected by a lack of data.

This same method can be adapted to decrypted ciphertext in other languages you just need to change the frequencies of letters that the Chi-Squared Statistic uses.

[powr-hit-counter id=4db2581c_1482002480525]

Chi-Squared Statistic

The Chi-Squared Statistic is a measure of how two categorical distributions differ from one another. So for 2 identical distributions the score would be 0 and as the distributions begin to diff the score will increase. The formula is…

$X^{2}=\sum_{i=A}^{i=Z}\frac{(O_{i}-E_{i})^2}{E_{i}}$
O_i is the observed count of that letter in your text.
E_i is the expected count of that letter in the length of your text.

Chi-Squared Statistic in words is, “the sum, of the squared difference between observed count and expected count divided by the expected count, of each letter.”

Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31

Letter	Observed Count (O_i)	Frequency in English	Expected Count (E_i)*	(O_{i –}E_i)²/E_i
A	2	8.17%	2.53177	0.11169
B	0	1.49%	0.46252	0.46252
C	3	2.78%	0.86242	5.29817
D	0	4.25%	1.31843	1.31843
E	5	12.70%	3.93762	0.28663
F	0	2.23%	0.69068	0.69068
G	0	2.02%	0.62465	0.62465
H	2	6.09%	1.88914	0.00651
I	1	7.00%	2.16876	0.62985
J	0	0.15%	0.04743	0.04743
K	3	0.77%	0.23932	31.84587
L	2	4.03%	1.24775	0.45352
M	0	2.41%	0.74586	0.74586
N	1	6.75%	2.09219	0.57016
O	1	7.51%	2.32717	0.75688
P	0	1.93%	0.59799	0.59799
Q	0	0.10%	0.02945	0.02945
R	1	5.99%	1.85597	0.39477
S	2	6.33%	1.96137	0.00076
T	5	9.06%	2.80736	1.71252
U	0	2.76%	0.85498	0.85498
V	1	0.98%	0.30318	1.60155
W	2	2.36%	0.73160	2.19907
X	0	0.15%	0.04650	0.04650
Y	0	1.97%	0.61194	0.61194
Z	0	0.07%	0.02294	0.02294
Total	31	1.00029	31.00899	51.92133

* Expected Count = FREQ / 100 × LEN

For English a Chi-Squared value of about 150 or less is expected anything above does likely does not resemble English.

WHENTHECLOCKSTRIKESTWELVEATTACK, X² = 51.92133
THWKEEVIWTETSCANHKERCTTAKSCLLOE, X² = 51.92133
ZDXPLXTDOWXSWCRSGPWVVOCWEOTTXOK, X² = 425.59631

As you can see English text scores low however score is independent of letter order and a random text does not score highly.

I have created an Excel spreadsheet that can calculate Chi-Squared when given the frequencies of letters. It does not use macros. Chi-Squared Calculator