Ever notice how potential 340 solves almost inevitably get stuck exploring solutions like:
GSHESCENDUPONSOFM…
KSONTOBESKHFLAPRE…
KTOLDOMEVESTITUNA…
Having it lead off with fragments like "GSHES" "KSON", or "KTOLD" seems pretty unlikely. Which is unfortunate. The greatest strength of ngram models is their positional independence. But it’s also repeatedly causing an intuitively difficult to accept bias away from the correct solution space. How can we fix this?
Consider the top 30 ngrams from
1) the first position of messages in a test corpus, 2) the full distribution:
1ST ALL THANK THING THATS THERE THERE INGTH ITHIN OFTHE THISI WOULD IDONT INTHE IFYOU ATION IHAVE OTHER YOURE ABOUT IWOUL EOPLE YEAHI PEOPL IMNOT ANDTH YOUCA INGTO BECAU THINK IKNOW THATS ILOVE CAUSE MAYBE NGTHE ILIKE ETHAT DOYOU NDTHE WELLI ATTHE IJUST ECAUS WHATI IFYOU SORRY STHAT ANDTH BECAU EVERY EVERY THATI TOTHE ITSNO REALL WHATA ONTHE YOUAR TIONS
- Obviously, firstgrams don’t contain any leading-edge fragments like EOPLE, INGTHE, STHAT, ETHAT, etc.[/*:m:3pqtzuy1]
- Another anomaly is how few firstgrams there are. Only 4% of all possible ngrams occur in position 1, even though 100% of 5-grams occur in the full distribution. For reference, ngrams in position 2 – 5 have 50%, 54%, 54%, and 55% of the full distribution represented (respectively). Only the 1st position is this severely depleted of diversity.[/*:m:3pqtzuy1]
- Also, the rankings of each 5-gram are distributed rather differently. For example, it often turns out that even the most common ngrams in each given position (top 1000) won’t even appear in the top 5000 of the full, overall distribution. The amount missing from position 1 – 5 are 52%, 58%, 31%, 6%, and 3%. So by the 4th position, extreme skewing of the distributions has relaxed and the full distribution is once again at least semi-representative. But the first 3 positions of each new message are quite different — with the 1st position being both an outlier in terms of how depleted its distribution is and how skewed it is. 6-grams and 7-grams show a similar pattern.[/*:m:3pqtzuy1]
- Firstgrams also look incredibly appealing as lead-off cribs for z340. Consider that "ILIKE" (z408’s correct lead-off crib) naturally occurs 18th on the firstgram list. For comparison, it’s buried 1164th in the full distribution. Sure, 1164th is still high in a list of 11 million, but in a logarithmic distribution, it’s not really the same ballpark. And if it turns out that the 340 key was created on the fly, then there’s perhaps even more reason than normal to more carefully rank and score the lead-off portion of z340 than we’d otherwise need to. What a happy coincidence that it’s possible to do this, since the distribution is so naturally small and different.[/*:m:3pqtzuy1][/list:u:3pqtzuy1]
- Firstgram files should be ~10x smaller than regular, full-distribution ngram files.[/*:m:3pqtzuy1]
- Once you’re using Firstgrams, you could also slightly sharpen the rest of the ngram distribution by subtracting the firstgram differences from the main distribution too.[/*:m:3pqtzuy1]
- There are also opportunities for model mixing. You know what firstgrams look like? The beginnings of new words! And the Zodiac Killer’s writing is highly staccato, registering a new word every 3.97 characters. This is well below typical English writing where new words only occur every 5.0 words. So mixing in at least 50% firstgram weight into roughly every 4th character of the 340 is another broad, easy to implement, potential improvement to help the hill climber get broadly going in the proper direction in many locations, and not just the beginning. [/*:m:3pqtzuy1][/list:u:3pqtzuy1]
What would it take to track and use this sort of data in practice? Perhaps we could start using special firstgram files that contain only these sorts of distributions that remove leading-edge anomalies and are heavily weighted towards typical phrases used in opening remarks.
Ever notice how potential 340 solves almost inevitably get stuck exploring solutions like:
Yes we have. It’s a small local maximum.
– Obviously, firstgrams don’t contain any leading-edge fragments like EOPLE, INGTHE, STHAT, ETHAT, etc.
– Another anomaly is how few firstgrams there are. Only 4% of all possible ngrams occur in position 1, even though 100% of 5-grams occur in the full distribution. For reference, ngrams in position 2 – 5 have 50%, 54%, 54%, and 55% of the full distribution represented (respectively). Only the 1st position is this severely depleted of diversity.
– Also, the rankings of each 5-gram are distributed rather differently. For example, it often turns out that even the most common ngrams in each given position (top 1000) won’t even appear in the top 5000 of the full, overall distribution. The amount missing from position 1 – 5 are 52%, 58%, 31%, 6%, and 3%. So by the 4th position, extreme skewing of the distributions has relaxed and the full distribution is once again at least semi-representative. But the first 3 positions of each new message are quite different — with the 1st position being both an outlier in terms of how depleted its distribution is and how skewed it is. 6-grams and 7-grams show a similar pattern.
– Firstgrams also look incredibly appealing as lead-off cribs for z340. Consider that "ILIKE" (z408’s correct lead-off crib) naturally occurs 18th on the firstgram list. For comparison, it’s buried 1164th in the full distribution. Sure, 1164th is still high in a list of 11 million, but in a logarithmic distribution, it’s not really the same ballpark. And if it turns out that the 340 key was created on the fly, then there’s perhaps even more reason than normal to more carefully rank and score the lead-off portion of z340 than we’d otherwise need to. What a happy coincidence that it’s possible to do this, since the distribution is so naturally small and different.
Very interesting, thank you for your research.
– Once you’re using Firstgrams, you could also slightly sharpen the rest of the ngram distribution by subtracting the firstgram differences from the main distribution too.
That seems not intuitive to me. First-grams are the beginning of sentences and the exact number of sentences in the cipher text is unknown.
– There are also opportunities for model mixing. You know what firstgrams look like? The beginnings of new words! And the Zodiac Killer’s writing is highly staccato, registering a new word every 3.97 characters. This is well below typical English writing where new words only occur every 5.0 words. So mixing in at least 50% firstgram weight into roughly every 4th character of the 340 is another broad, easy to implement, potential improvement to help the hill climber get broadly going in the proper direction in many locations, and not just the beginning.
Since then 1 out of 4 is the start of a new word, there would be a first-gram mismatch 3 out of 4 since the actual start of the words are unknown.
In general, I think first-grams can improve things when multiplicity or something similarly acting is an issue. And positional awareness (or range) is the weakness of n-gram models. Using word level n-grams on top of letter level n-grams would certainly help. https://quipqiup.com/ does that.
– Once you’re using Firstgrams, you could also slightly sharpen the rest of the ngram distribution by subtracting the firstgram differences from the main distribution too.
That seems not intuitive to me. First-grams are the beginning of sentences and the exact number of sentences in the cipher text is unknown.
It’s a quirk of how ngrams are collected and compiled. As you add in each line, Ngrams go from having a 0.0% chance of being a mid-word fragment to being 80% likely.
The mid-position ngrams are more fully mixed. I visualize this as the middle ngrams all being "smeared" into the average of all words, at all positions. Whereas the leading ngrams collected from each line are visually "crisp" distribution that are only the average of all words, but without any positional smear to their average.
Imagine a degenerate source corpus where every message is only 9 characters long. You collect 5 x 5-grams per line and combine 1,000,000 of them into an ngram file. Now 20% of all ngrams come from position 1 and lack the "normal" distributional smear. Contrast that with a more normal message corpus that had 90 character sentences. Now you can collect 86 x 5-grams per line and combine 1,000,000 of those into a second file. Now only 1.16% of ngrams are un-smeared.
I agree we can’t know how many sentence there are in the 340 or exactly where words start. But our ignorance about these features essentially boils down to our model needing to contain the same sort of positional smear over the entire 340 that all the middle-message derived ngrams exhibit naturally. The one exception is the lead-off of the message, which is where we could somewhat confidently use firstgrams because both of those don’t contain this smear.
So if you only have a single distribution to draw from, you’d ideally want ngrams collected from messages that were an average of 340 characters long so the proporation of smeared ngrams is the same. Failing that, you could re-weight or mix in corpus with longer continuous passages.
Since then 1 out of 4 is the start of a new word, there would be a first-gram mismatch 3 out of 4 since the actual start of the words are unknown.
Well, the 1st gram distribution and the regular distribution isn’t that radically different, so making mistakes is probably fine. The hope would be to weight it so the info you gain outweighs the info you lose. You can do that even if you’re wrong over 50%.
From a theoretical perspective, if you only had 1 shot to try and solve it, and you wanted to maximize average informational gain, you would only mix in 25% of the firstgram score into 85 different spots on the 340.
But if you know you’re doing hundred of solve attempts, your instead overweight firstgram scores above avg informational gain levels so as to maximize the information gain on just the luckiest runs of your solver.
As an aside, you could easily improve the word-start guess ratio from 1:4 to ~1:3. To see why, visualize how bad it would be to put all 85 guesses in a row. Many patterns of randomness are nearly this degenerate. Iteratively using a realistic word sizing heuristic recovers almost a full position of accuracy. So 33% firstgram mixing might be best for avg information gain, but then you bump up to 50% to maximize max information gain on lucky runs.