Zodiac Discussion Forum

Notifications

Clear all

Language Statistics

Zodiac Cipher Mailings & Discussion

Last Post by Fisherman'sFriend 6 years ago

4 Posts

3 Users

0 Reactions

1,254 Views

RSS

Fisherman'sFriend

(@fishermansfriend)

Posts: 132

Estimable Member

Topic starter

Does anyone know of a good, general source for language statistics?

I’ve been having trouble finding an answer to the following question:

What are the average distances between individual occurrences of letters in english?

For example in this post, the average distance between all occurrences of the letter "e" is: 6.46 or let’s say 6.

Is there a source for these distances? averages, means, etc.?

Posted : May 21, 2020 1:55 am

Jarlve

(@jarlve)

Posts: 2550

Famed Member

You may want to be careful with how you measure it because I think there will be a very strong correlation with its frequency. In such that "e" will have the smallest average distance, and "a" will have the second smallest average distance etc… ETAOINSHRDLU. I don’t have my corpus ready at the moment and I’m hoping someone else can assist you. But you may want to think about how you want to measure it, and what you want to get out of it. Function?

EDIT: for example, in your text, there are 313 letters, 52 being "e", 313 / 52 = 6.01.

AZdecrypt

Posted : May 21, 2020 10:11 am

doranchak

(@doranchak)

Posts: 2614

Member Admin

Here are the results from my corpus scanner. My corpus is a mix of reddit comments and Project Gutenberg books, and each sample is taken randomly. Anything that isn’t a letter A-Z is removed.

90 sources. 100,785,142 characters.

Letter distance averages:

E 8.53
T 10.53
O 12.65
A 12.70
I 13.77
N 15.40
S 15.66
R 18.03
H 20.32
L 23.42
D 26.75
U 31.49
M 34.88
C 35.96
Y 42.35
P 44.15
G 44.90
W 45.98
F 50.34
B 59.82
K 90.52
V 95.71
J 407.38
X 425.65
Q 756.62
Z 868.09

We can see that Jarlve’s prediction is correct: The letter distance averages are strongly related to expected letter frequencies.

http://zodiackillerciphers.com

Posted : May 21, 2020 1:19 pm

Fisherman'sFriend

(@fishermansfriend)

Posts: 132

Estimable Member

Topic starter

Thank you both! Ask and ye shall receive!

I guess I have a vague idea that distance would be a useful tool when looking at possible transposition schemes.

I’m looking at an idea that seems to be showing me some matching strings of letters from different parts of the cipher and I’m basically wondering what kind of useful assumptions I could possibly make about them. Hopefully assumptions that could lead to more letters in the strings, a stronger pattern. One thought was looking at distance, not just frequency. Seems like it could eliminate or open up some possibilities. For example, I could make an assumption that while I there are some "e’s" in a given string, they are also probably spaced in a certain way – as opposed to all in a row in the middle of a string.

Thanks for the info.

Posted : May 22, 2020 3:56 am

96 Forums
5,266 Topics
89.2 K Posts
14 Online
5,643 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed