Index of Coincidence

Index of Coincidence is the probability that when selecting two letters from a text (without replacement), the two letters are the same. For a random piece of text with every letter having a chance of \frac{1}{26} of appearing, the Index of Coincidence is also \frac{1}{26} ({0.0385} ).

If the frequency of the letters are known and the sum of the frequencies is 1 then this formula can be used to calculate Index of Coincidence for a particular language.

I.C=\sum_{i=A}^{i=Z}(F_{i})^{2}
Fi is the frequency, in decimal form (10% = 0.1), of a letter in your text.

For for a generic piece of text written in English the Index of Coincidence is 0.0667, it is different for each language as the letter frequencies are different…

Language Index of Coincidence
English 0.0667
French 0.0694
German 0.0734
Spanish 0.0729
Portuguese 0.0824
Turkish 0.0701
Swedish 0.0681
Polish 0.0607
Danish 0.0672
Icelandic 0.0669
Finnish 0.0699
Czech 0.0510

Values for this tabled created from the frequencies from Wikipedia. The values are for letters A-Z other letters such as ‘á’ or ‘â’ are considered to be the same as ‘a’, ‘ü’ or ‘ú’ are considered to be the same as ‘u’ etc…

However if you want to figure out the index of coincidence for a particular piece of text this formula can be used.

IoC=\frac{\sum_{i=A}^{i=Z}C_{i}(C_{i}-1)}{L(L-1)} 
Ci is the count, of a letter in the text.
Li is the total number of letters in the text.

If a letter does not appear more than once then is does not need to be involved in the calculation as when Ci is 1 or 0, Cx (Ci – 1) will equal 0;

Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31

Letter Count (Ci) Ci(C– 1)
A 2 2
C 3 6
E 5 20
H 2 2
K 3 6
L 2 2
M 0 0
S 2 2
T 5 20
W 2 2
Total 31 62

\frac{62}{31\times30}=0.0666

This value is reasonably close to the expected Index of Coincidence value of English (0.0667). It is also much higher than that the expected Index of Coincidence of random text (0.0385) suggesting that this text is not random.

The larger the Index of Coincidence the more likely that there is some sort of language structure behind text. For example the Vigenère Cipher has an average Index of Coincidence of 0.042 – suggesting that the text is not random, which it is not.

2 thoughts on “Index of Coincidence”

  1. Hi, thanks for the good explanation!

    One thing – the length of ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ is 31 and not 32 so the coincidence value is exactly 0.06666666666666667

    Thanks again, this really helped a lot!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.