Z340 another approach

Topic starter

AlanBenjy, Subject: Z340 another approach Fri Mar 15, 2013 12:58 pm

I have been thinking about another approach to the 340 recently and want to get working on it.
I’m not going to discuss the approach at this time as I don’t want to waste other people’s effort until I think it us worthy after some initial analysis by myself.

However I could do with some help in locating some good resources around word count and frequency withinn a given
Text space. I’m sure I’ve seen some on web before but cannot seem to find at the moment. I’m particularly interested in average word lengths and logical groupings ie how many ‘the’ or ‘and’ would appear in English text by volume of words within a paragraph or page etc.

Thanks for any initial pointers

Alan :-)

doranchak, Subject: Re: Z340 another approach Fri Mar 15, 2013 2:15 pm

I’ve been using the "all.num.o5" word frequency list from this site:

http://www.kilgarriff.co.uk/bnc-readme.html

It’s based on British corpora, however, which means there might be subtle differences between it and American English.

I’ve also used this one, a list of "common words from Google Books":

http://norvig.com/google-books-common-words.txt

Both files contain simple raw counts of words. But I’m not sure how directly applicable that is towards calculating things like number of times a certain word is expected to appear in a certain space.

doranchak, Subject: Re: Z340 another approach Fri Mar 15, 2013 2:30 pm

Out of curiosity, I took a Project Gutenberg transcription of "Tale of Two Cities" and used a script to count each word.

It works out to about 138,389 words. 8,053 of them are "THE", and 4,999 of them are "AND". So, in a given paragraph, you’d expect about 5.8% of the words to be "THE", and 3.6% of them to be "AND".

I’ve read that the average English word length is around 5.1 letters long. So, a 340-character cipher, say, could be expected to have around 67 words in it. About 4 of them could be "THE", and 2 of them could be "AND".

We could get more accurate "word rates" by scanning a larger corpus, of course.

Going back to all.num.o5 (one of the word frequency lists I mentioned), we find that the corpus it’s based on has 6,187,267 occurrences of "the" among 100,106,029 total words, which works out to about 6.1%. Not too far from our little "Tale of Two Cities" example.

Hope this helps.

AlanBenjy, Subject: Re: Z340 another approach Fri Mar 15, 2013 3:09 pm

Thanks Doranchak that’s a great start just what I was looking for :-)

Posted : April 7, 2013 2:49 pm

Zodiac Discussion Forum

Z340 another approach