coincidence – Cryptography

Index of Coincidence is the probability that when selecting two letters from a text (without replacement), the two letters are the same. For a random piece of text with every letter having a chance of $\frac{1}{26}$ of appearing, the Index of Coincidence is also $\frac{1}{26}$ ( ${0.0385}$ ).

If the frequency of the letters are known and the sum of the frequencies is 1 then this formula can be used to calculate Index of Coincidence for a particular language.

$I.C=\sum_{i=A}^{i=Z}(F_{i})^{2}$
F_i is the frequency, in decimal form (10% = 0.1), of a letter in your text.

For for a generic piece of text written in English the Index of Coincidence is 0.0667, it is different for each language as the letter frequencies are different…

Language	Index of Coincidence
English	0.0667
French	0.0694
German	0.0734
Spanish	0.0729
Portuguese	0.0824
Turkish	0.0701
Swedish	0.0681
Polish	0.0607
Danish	0.0672
Icelandic	0.0669
Finnish	0.0699
Czech	0.0510

Values for this tabled created from the frequencies from Wikipedia. The values are for letters A-Z other letters such as ‘á’ or ‘â’ are considered to be the same as ‘a’, ‘ü’ or ‘ú’ are considered to be the same as ‘u’ etc…

However if you want to figure out the index of coincidence for a particular piece of text this formula can be used.

$IoC=\frac{\sum_{i=A}^{i=Z}C_{i}(C_{i}-1)}{L(L-1)}$
C_i is the count, of a letter in the text.
L_i is the total number of letters in the text.

If a letter does not appear more than once then is does not need to be involved in the calculation as when C_i is 1 or 0, C_ix (C_i – 1) will equal 0;

Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31

Letter	Count (C_i)	C_i(C_i– 1)
A	2	2
C	3	6
E	5	20
H	2	2
K	3	6
L	2	2
M	0	0
S	2	2
T	5	20
W	2	2
Total	31	62

$\frac{62}{31\times30}=0.0666$

This value is reasonably close to the expected Index of Coincidence value of English (0.0667). It is also much higher than that the expected Index of Coincidence of random text (0.0385) suggesting that this text is not random.

The larger the Index of Coincidence the more likely that there is some sort of language structure behind text. For example the Vigenère Cipher has an average Index of Coincidence of 0.042 – suggesting that the text is not random, which it is not.