N-Gram Log Probability

For starters a n-gram is a group of n letters – particular sizes are often refered to as: 1 a unigram, 2 a bigram/digram, 3 a trigram, 4 a quadgram and 5 a quintgram.

In a language there is certain n-grams that are much more common than others, the quadgram “THER” has a much greater probability than “DOXW”. So if we were to split text up into all the n-grams making it up the text and multiply the probabilities of each n-gram together, we would get the probability of that specific piece of text being a certain language.

LOOKOUT contains 4 quadgrams LOOK, OOKO, OKOU and KOUT
P(LOOKOUT) = P(LOOK) \times P(OOKO) \times P(OKOU) \times P(KOUT)

As the text gets longer, the probability gets even smaller, so small that numerical underflow occurs as there are so many zeros in the decimal place that an accurate representation can’t be stored in 64 bits. The number basically become 0.

To get round this problem we log the probability. This makes the numbers more manageable, normally in the range of 0 to -2000. This is because the probability of the text is the product of all the probabilities of individual each n-gram. So using the log rule \log{a\times b}=\log{a}+\log{b} you can actually log the individual n-gram probabilities and add them all.

log(P(LOOKOUT)) = log(P(LOOK)) + log(P(OOKO)) + log(P(OKOU)) + log(P(KOUT))

To first the probabilities of each quadgram needs to be determined

P(ABCD)=\frac{C_{ABCD}}{N} 
CABCD is the number of times the particular quadgram occurs
N is the total number of quadgrams in the list

You can find lists of quadgram frequency online or create your own using large samples of text. This can have its advantages – if you create your statistics from a sample of text similar to what you are trying to score this can give better results.

Index of Coincidence

Index of Coincidence is the probability that when selecting two letters from a text (without replacement), the two letters are the same. For a random piece of text with every letter having a chance of \frac{1}{26} of appearing, the Index of Coincidence is also \frac{1}{26} ({0.0385} ).

If the frequency of the letters are known and the sum of the frequencies is 1 then this formula can be used to calculate Index of Coincidence for a particular language.

I.C=\sum_{i=A}^{i=Z}(F_{i})^{2}
Fi is the frequency, in decimal form (10% = 0.1), of a letter in your text.

For for a generic piece of text written in English the Index of Coincidence is 0.0667, it is different for each language as the letter frequencies are different…

Language Index of Coincidence
English 0.0667
French 0.0694
German 0.0734
Spanish 0.0729
Portuguese 0.0824
Turkish 0.0701
Swedish 0.0681
Polish 0.0607
Danish 0.0672
Icelandic 0.0669
Finnish 0.0699
Czech 0.0510

Values for this tabled created from the frequencies from Wikipedia. The values are for letters A-Z other letters such as ‘á’ or ‘â’ are considered to be the same as ‘a’, ‘ü’ or ‘ú’ are considered to be the same as ‘u’ etc…

However if you want to figure out the index of coincidence for a particular piece of text this formula can be used.

IoC=\frac{\sum_{i=A}^{i=Z}C_{i}(C_{i}-1)}{L(L-1)} 
Ci is the count, of a letter in the text.
Li is the total number of letters in the text.

If a letter does not appear more than once then is does not need to be involved in the calculation as when Ci is 1 or 0, Cx (Ci – 1) will equal 0;

Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31

Letter Count (Ci) Ci(C– 1)
A 2 2
C 3 6
E 5 20
H 2 2
K 3 6
L 2 2
M 0 0
S 2 2
T 5 20
W 2 2
Total 31 62

\frac{62}{31\times30}=0.0666

This value is reasonably close to the expected Index of Coincidence value of English (0.0667). It is also much higher than that the expected Index of Coincidence of random text (0.0385) suggesting that this text is not random.

The larger the Index of Coincidence the more likely that there is some sort of language structure behind text. For example the Vigenère Cipher has an average Index of Coincidence of 0.042 – suggesting that the text is not random, which it is not.

Chi-Squared Statistic

The Chi-Squared Statistic is a measure of how two categorical distributions differ from one another. So for 2 identical distributions the score would be 0 and as the distributions begin to diff the score will increase. The formula is…

X^{2}=\sum_{i=A}^{i=Z}\frac{(O_{i}-E_{i})^2}{E_{i}}
Oi is the observed count of that letter in your text.
Ei is the expected count of that letter in the length of your text.

Chi-Squared Statistic in words is, “the sum, of the squared difference between observed count and expected count divided by the expected count, of each letter.”

Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31

Letter Observed Count (Oi) Frequency in English Expected Count (Ei)* (Oi – Ei)2/Ei
A 2 8.17% 2.53177 0.11169
B 0 1.49% 0.46252 0.46252
C 3 2.78% 0.86242 5.29817
D 0 4.25% 1.31843 1.31843
E 5 12.70% 3.93762 0.28663
F 0 2.23% 0.69068 0.69068
G 0 2.02% 0.62465 0.62465
H 2 6.09% 1.88914 0.00651
I 1 7.00% 2.16876 0.62985
J 0 0.15% 0.04743 0.04743
K 3 0.77% 0.23932 31.84587
L 2 4.03% 1.24775 0.45352
M 0 2.41% 0.74586 0.74586
N 1 6.75% 2.09219 0.57016
O 1 7.51% 2.32717 0.75688
P 0 1.93% 0.59799 0.59799
Q 0 0.10% 0.02945 0.02945
R 1 5.99% 1.85597 0.39477
S 2 6.33% 1.96137 0.00076
T 5 9.06% 2.80736 1.71252
U 0 2.76% 0.85498 0.85498
V 1 0.98% 0.30318 1.60155
W 2 2.36% 0.73160 2.19907
X 0 0.15% 0.04650 0.04650
Y 0 1.97% 0.61194 0.61194
Z 0 0.07% 0.02294 0.02294
Total 31 1.00029 31.00899 51.92133

* Expected Count = FREQ / 100 × LEN

For English a Chi-Squared value of about 150 or less is expected anything above does likely does not resemble English.

WHENTHECLOCKSTRIKESTWELVEATTACK, X2 = 51.92133
THWKEEVIWTETSCANHKERCTTAKSCLLOE, X2 = 51.92133
ZDXPLXTDOWXSWCRSGPWVVOCWEOTTXOK, X2 = 425.59631

As you can see English text scores low however score is independent of letter order and a random text does not score highly.

I have created an Excel spreadsheet that can calculate Chi-Squared when given the frequencies of letters. It does not use macros. Chi-Squared Calculator