Ngram benchmarks

beijinghouse · 2019-01-26T00:34:03Z

Can anyone recommend a set of deciphering tests to judge whether a new ngram file is better than another one? I've developed new ngram files and they seem better in my judgement, but it's hard to quantify without a good way to measure them. Right now, I'm seeing how short a versions of the 408 it can solve: custom 5-gram 7 line 408 solves instantly 6 line 408 solves pretty ok 4.5 lines 408 solves with seq homophones custom 6-gram 5 line 408 solves quickly w/ only regular solver 4.7 line 408 solves w/ regular solver after a min 4 lines 408 solves with seq homophones easy 3.5 lines 408 solve w/ seq homophones "[The sequential homophone solver can now] Solve on a 5 row 408 with 6-grams, AZdecrypt's previous best was 6 rows with 7-grams" "Solve on a 6 row 408 with 5-grams, a tie with AZdecrypt's previous best which had to use 7-grams to accomplish this" Even if this is a good test, it's kind of becoming hard to use as I improve my files because I'm not sure that any language model can be good enough to solve a 3 line 408. My only other informal benchmark has been solve time for "Smokie treats: Perfect cycles + 3 high count 1:1 substitutes + 4 medium-high count wildcards/polyalphabetic symbols" w/ seq homophones: custom 5-grams: 55 sec custom 6-grams: 58 sec 5-gram pc: n/a 5-gram-reddit: 1:39 6-gram-reddit: 1:05 Can anyone suggest other benchmarks? I've skimmed through the "Test Cipher Collection" that doranchak kindly compiled. Nothing jumps out at me as an obvious yardstick, but I don't really know where to look:... 1394933925

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Hey beijinghouse,

Your n-grams seem to be overtuned/biased towards Zodiac text. In that way it is not fair to compare against other n-grams using the 408 (4,5 line solves etc).

I tested 100 texts including a part of the 408:

Your n-grams: average score of all texts (22154) with the 408 having the highest score (25018). Highest non-Zodiac plaintext (23110).
Practical cryptography n-grams: average score of all texts (24676) with some other text having the highest score (26152) and the 408 scoring (24864).
My reddit n-grams: average score of all texts (23453) with some other text having the highest score (25110) and the 408 scoring (24609).

AZdecrypt

Posted : February 25, 2019 8:32 pm

beijinghouse

(@beijinghouse)

Posts: 34

Eminent Member

Topic starter

The ngrams are augmented with all of Zodiac’s writings and phone calls. Nudging the model to prefer more Zodiac-like phrases and words is desirable since it’s designed to be more capable of attempting a 340 solve. The 408 is just a very Zodiac-like text, so it scores fabulously because of it.

You’re right that these models "score lower" in general.

But they also have:
[list=1]1. Higher solve rates
2. Faster solve times, and
3. More accurate final solutions
4. While being 88% smaller files[/list:o:3ncs95kk]

The corollary of the n-grams favoring Zodiac-like text is they necessarily have artificial scoring deficits on all non-Zodiac inspired texts. Your scoring battery is mostly normal text, so these ngrams have an innate lower scoring potential. That’s fine. They’re only designed to score their highest when they’re correctly solving a Zodiac cipher. And it’s not really a big deal if they have lower relative scores on mainstream texts while still correctly solving a wider range of ciphers than any other model.

The Zodiac used a very small vocabulary. A nice outgrowth of this is that biasing the ngrams to favor his writing style doesn’t significantly effect their ability to simultaneously decode normal English writing extremely well. The power of these files isn’t coming from their biasing towards Zodiac-style text by tweaking the weighting of 0.06% of the ngrams by over-weighting his writings. It’s coming from the superior ordering of the other 99.9% of ngrams the Zodiac never used. These ngrams were ordered with an overwhelming sample of input text. They’re far more representative of general written English text than any previously constructed ngram models. Practical cryptography and even the reddit ngram routinely get very high scores even when they’re trapped in gibberish decipherments. Meanwhile, my ngrams get lower scores even when they solve a cipher instantly.

For example, which solution should we prefer? The right one? Or the wrong one with the 12% higher score?

6-grams_english_beijinghouse_beta2.txt
Score: 21534.64 Ioc: 0.05263 Multiplicity: 0.43589
N-grams: 0 PC-cycles: 37

FORTYFEETBELOWTWOMILLIONSOUNDSAREBURIED


6-grams_english_jarlve_reddit.txt
Score: 24274.67 Ioc: 0.05668 Multiplicity: 0.43589
N-grams: 1 PC-cycles: 37

NTRDONEEDBESTODOTHISSITACTUALLYREBURIEL

Posted : February 26, 2019 2:18 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

I was surprised to find the bias since you didn’t mention it and now I am left wondering what other biases are there to existing texts/ciphers?

AZdecrypt

Posted : February 26, 2019 8:29 pm

beijinghouse

(@beijinghouse)

Posts: 34

Eminent Member

Topic starter

There is no special tuning to any known cipher plaintexts. There’s only an implicit tuning to the 408 because I augment the ngrams with all of Zodiac’s writing and the 408 is 2.5% of that writing.

In an alternate universe where somehow the 408 had never been solved, augmenting ngrams with all of Zodiac’s other writing (the other 97.5% of known text w/o any of 408’s plaintext) would have been one of the best strategies to assist any given solver with solving the 408. For instance, his first known letter "The Confession" contains nearly every ngram in the 408 and in roughly the same proportions. He even uses the same shortened average word length (4 letters vs 5 in typical English).

Pretending the 408 is unknown today and not including knowledge of it would only be useful as a diagnostic exercise to make ngram files more directly comparable, because prior versions didn’t implement this optimization. But unbiased files wouldn’t actually be useful for attempting to solve the 340. If someone came back from the future and told me that vanilla 10-gram files are needed to solve the 340, I would bet my bottom dollar that augmented 9-grams would work too. This general argument applies to every length of n-gram. I’d estimate augmenting provides approximately 1 extra n-gram worth of solving power. If the 340 can be solved, it’s a safe assumption that an augmented model will be the first type of model to solve it.

Perhaps I’ll construct augmented versions of all your current ngram files so we can do more comparable, apples-to-apples testing. Even without the original input, I could make very precise updates to each that would give them the exact same advantages that my new files are getting from Zodiac text augmenting. Then you could compare scores head-to-head for all the files and any score differences would actually be totally comparable and meaningful.

Posted : February 28, 2019 2:21 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

I am okay with the augmenting/bias and I was just a little worried.

For instance, his first known letter "The Confession" contains nearly every ngram in the 408 and in roughly the same proportions. He even uses the same shortened average word length (4 letters vs 5 in typical English).

Yeah I also noticed the 408 had a shortened average word length. The confession is not a confirmed Zodiac communication but if you can make a strong argument then that could be a very valuable contribution.

Perhaps I’ll construct augmented versions of all your current ngram files so we can do more comparable, apples-to-apples testing. Even without the original input, I could make very precise updates to each that would give them the exact same advantages that my new files are getting from Zodiac text augmenting. Then you could compare scores head-to-head for all the files and any score differences would actually be totally comparable and meaningful.

Not needed. I still would like to test your n-grams without bias if possible.

Thanks for all your hard work so far.

AZdecrypt

Posted : February 28, 2019 6:54 pm

Zodiac Discussion Forum