Does anyone know of a good, general source for language statistics?
I’ve been having trouble finding an answer to the following question:
What are the average distances between individual occurrences of letters in english?
For example in this post, the average distance between all occurrences of the letter "e" is: 6.46 or let’s say 6.
Is there a source for these distances? averages, means, etc.?
You may want to be careful with how you measure it because I think there will be a very strong correlation with its frequency. In such that "e" will have the smallest average distance, and "a" will have the second smallest average distance etc… ETAOINSHRDLU. I don’t have my corpus ready at the moment and I’m hoping someone else can assist you. But you may want to think about how you want to measure it, and what you want to get out of it. Function?
EDIT: for example, in your text, there are 313 letters, 52 being "e", 313 / 52 = 6.01.
Here are the results from my corpus scanner. My corpus is a mix of reddit comments and Project Gutenberg books, and each sample is taken randomly. Anything that isn’t a letter A-Z is removed.
90 sources. 100,785,142 characters.
Letter distance averages:
E 8.53
T 10.53
O 12.65
A 12.70
I 13.77
N 15.40
S 15.66
R 18.03
H 20.32
L 23.42
D 26.75
U 31.49
M 34.88
C 35.96
Y 42.35
P 44.15
G 44.90
W 45.98
F 50.34
B 59.82
K 90.52
V 95.71
J 407.38
X 425.65
Q 756.62
Z 868.09
We can see that Jarlve’s prediction is correct: The letter distance averages are strongly related to expected letter frequencies.
Thank you both! Ask and ye shall receive!
I guess I have a vague idea that distance would be a useful tool when looking at possible transposition schemes.
I’m looking at an idea that seems to be showing me some matching strings of letters from different parts of the cipher and I’m basically wondering what kind of useful assumptions I could possibly make about them. Hopefully assumptions that could lead to more letters in the strings, a stronger pattern. One thought was looking at distance, not just frequency. Seems like it could eliminate or open up some possibilities. For example, I could make an assumption that while I there are some "e’s" in a given string, they are also probably spaced in a certain way – as opposed to all in a row in the middle of a string.
Thanks for the info.