The Chi-Squared Statistic is a measure of how two categorical distributions differ from one another. So for 2 identical distributions the score would be 0 and as the distributions begin to diff the score will increase. The formula is…
Oi is the observed count of that letter in your text.
Ei is the expected count of that letter in the length of your text.
Chi-Squared Statistic in words is, “the sum, of the squared difference between observed count and expected count divided by the expected count, of each letter.”
Example: ‘WHENTHECLOCKSTRIKESTWELVEATTACK’ Text length = 31
Letter | Observed Count (Oi) | Frequency in English | Expected Count (Ei)* | (Oi – Ei)2/Ei |
A | 2 | 8.17% | 2.53177 | 0.11169 |
B | 0 | 1.49% | 0.46252 | 0.46252 |
C | 3 | 2.78% | 0.86242 | 5.29817 |
D | 0 | 4.25% | 1.31843 | 1.31843 |
E | 5 | 12.70% | 3.93762 | 0.28663 |
F | 0 | 2.23% | 0.69068 | 0.69068 |
G | 0 | 2.02% | 0.62465 | 0.62465 |
H | 2 | 6.09% | 1.88914 | 0.00651 |
I | 1 | 7.00% | 2.16876 | 0.62985 |
J | 0 | 0.15% | 0.04743 | 0.04743 |
K | 3 | 0.77% | 0.23932 | 31.84587 |
L | 2 | 4.03% | 1.24775 | 0.45352 |
M | 0 | 2.41% | 0.74586 | 0.74586 |
N | 1 | 6.75% | 2.09219 | 0.57016 |
O | 1 | 7.51% | 2.32717 | 0.75688 |
P | 0 | 1.93% | 0.59799 | 0.59799 |
Q | 0 | 0.10% | 0.02945 | 0.02945 |
R | 1 | 5.99% | 1.85597 | 0.39477 |
S | 2 | 6.33% | 1.96137 | 0.00076 |
T | 5 | 9.06% | 2.80736 | 1.71252 |
U | 0 | 2.76% | 0.85498 | 0.85498 |
V | 1 | 0.98% | 0.30318 | 1.60155 |
W | 2 | 2.36% | 0.73160 | 2.19907 |
X | 0 | 0.15% | 0.04650 | 0.04650 |
Y | 0 | 1.97% | 0.61194 | 0.61194 |
Z | 0 | 0.07% | 0.02294 | 0.02294 |
Total | 31 | 1.00029 | 31.00899 | 51.92133 |
* Expected Count = FREQ / 100 × LEN
For English a Chi-Squared value of about 150 or less is expected anything above does likely does not resemble English.
WHENTHECLOCKSTRIKESTWELVEATTACK, X2 = 51.92133 THWKEEVIWTETSCANHKERCTTAKSCLLOE, X2 = 51.92133 ZDXPLXTDOWXSWCRSGPWVVOCWEOTTXOK, X2 = 425.59631
As you can see English text scores low however score is independent of letter order and a random text does not score highly.
I have created an Excel spreadsheet that can calculate Chi-Squared when given the frequencies of letters. It does not use macros. Chi-Squared Calculator
Best explanation I’ve found. Thanks