Quicktrader, Subject: CIPHER STRUCTURE Thu Dec 27, 2012 8:02 am
Ok, had a better look on the 408 cipher structure to better understand the framework of the 340 and the way how Z had ‘worked’.
First to clarify with the ‘errors’ Z had made in his 408 cipher:
FORREST, EXPERENCE and PARADICE presumably have been made by lack of knowledge.
‘ANAMAL’ (instead of ‘animal’) and
‘MOAT’ (instead of ‘most’)
however were caused because Z had made one simple error by mixing 3-4 triangle-like symbols (accidentially or by purpose):
Instead of using (representing the letter ‘S‘) he had used the (representing the letter ‘A‘). This lead to the two ‘MOAT’ errors.
Also he had used the (representing the letter ‘A’) instead of using which would’ve represented the letter ‘I‘. This lead to the ‘ANAMAL’ error.
All of those three symbols look quite similar, which is why I think that Z had made this error accidentially. There is another switch of the to the – also similar – symbol, which would have correctly represented the letter ‘W‘. This lead to the ‘SLOI’ error.
When correcting/ignoring those errors, you get a solid structure of how Z had set-up his homophone cipher by using sequences:
While about 45 symbols seem to not be part of the solid sequence structure and even 6 errors still occur, most of the cipher follows these sequences (green area, representing 87.5%). This is why I believe that looking for sequences in the 340 is the best approach.
The overall length of the sequences for each letter therefore is:
E – 7 (plus ‘interruptors’)
T – 4
A – 4
I – 4
N – 5>4>4>4>3>3 (decreasing)
S – 4
R – 3
O – 3
L – 3 (mixed)
H – 2
F – 2
D – 2 (first one mixed)
B – 1
C – 1
G – 1
K – 1
M – 1
P – 1
U – 1
V – 1
W – 1
X – 1
Y – 1
J – 0
Q – 0
Z – 0
According to this, Z had used sequences mainly for the more frequent letters while almost half of the letters (the less frequent) have been replaced by one symbol only.
The 340 has more homophones, therefore it may be assumed that the sequences for ETAINSROLHFD would be either longer or at least more present (or both). Most sequences with a length of 5 or longer presumably would have been used for the ETAOIN letter family (to hide any cipher structure).
Z also had differed the way of sequencing, e.g. with the letter ‘N’ (decreasing) or the letter ‘L’ (completely mixed, although consistantly using three symbols). With ‘L’ he had tried to hide any structures such as ‘will’, ‘kill’ etc. It might be assumed that 2-dump-n-grams would rely to double-letters, such as mm, ff, dd, ss.
If used once in a sequence, the symbols occured more often. This is why it may be assumed that rare symbols, such as the square with the spot inside, indeed represent rare letters, such as letters from the JQXZ family, too.
Overall, if Z had used a similar ciphering method, the last third of the cipher is less relevant as the sequences dilute more the longer the cipher is. By figuring out the top e.g. 5 sequences with a length of >4, this could lead to 5! = 5*4*3*2*1 = 120 variations of the cipher. All of those variations would show up with approx. 40% of the cipher filled out, only one of them being the right combination of letters and sequences.
It seems to be a fact that the 340 has only shorter sequences to be recognized, however more homophones were used to make the cipher ‘harder’ to crack (+500m possible variations = level of Z’s idiocy..).
Please also be aware that for getting the cleartext, the sequence itself is not relevant as all symbols of one sequence do represent the same letter..(cipher errors are the real troublemakers). Overall it should also be considered that the change of sequences might have not been made by purpose or accidentially, but rather by using a cipher that is more developed than a ‘simple’ homophone cipher, e.g. by adding such asymmetry, therefore the sequences all being correctly following such asymmetric enciphering method.
Furthermore it might be assumed that symbols that occur regularly in the beginning of the cipher, however don’t appear at the end anymore, could be parts of decreasing sequences and vice versa.
As there is no need for setting up a sequence, the letters of the BCGKMPWY might also rather occur as a single substitution (letters that are not very frequent but presumably appear at least >3-5 times in the 340 cipher).
QueenOfClews, Subject: Re: CIPHER STRUCTURE Fri Feb 08, 2013 3:32 pm
I first want to say that I am very impressed by the level of sophistication and detail of the discussions and analyses in this thread.
I get the impression when reading some of the letters that there is some connection between Z’s misspellings and his code. It makes me think that my brain is recognizing something on a subtle level, but cannot bring the connection to the conscious thought level.
In an effort to flush the idea out a bit, I started going through the letters and finding all the instances of misspellings. I entered them into a spreadsheet in a format which documented the the following characteristics, the word as Z spelled it in the letter, the correct spelling of the word, the error made. The errors fell into different repetitive categories, for example, omitted letters, added letters, correct letter in a wrong position, wrong letter in correct position, etc. I was trying to find out if the resulting list of omissions or additions may have spelled something or revealed a pattern that may have been relevant to solving the cipher.
I have only done the Mikado letter thus far, and until my temperamental tablet has a new female plug receiver installed, I can’t access the file.
In any case, I am wondering if anyone else can see what I am trying to get at here, and whether the idea has any merit or if it is likely a dead end.
tahoe27, Subject: Re: CIPHER STRUCTURE Fri Feb 08, 2013 4:08 pm
…It makes me think that my brain is recognizing something on a subtle level, but cannot bring the connection to the conscious thought level.
^I love this comment. :cheers:
Welcome to my world. I think many feel the answer is right there waiting to smack us upside the head.
smithy, Subject: Re: CIPHER STRUCTURE Sat Feb 09, 2013 10:00 am
A nice diagram, that.
I agree that Zodiac used a sequential substitution code for the 408 and am surprised that more people don’t try to find a solution to the 340 by finding sequences in the 340. There are dozens, but many of them may be just random coincidence and/ or overlap each other, which makes things difficult.
See: http://blog.jgc.org/2011/06/how-zodiac- … -cipher.ht
See: http://s262.photobucket.com/user/tony59 … t.gif.html
See: http://www.ciphermysteries.com/2011/09/ … 40-ciphers
The trick is to sort through all of the sequences in the 340 and separate out the sequences that are not random. Short sequences could easily be random. But longer sequences most likely are not.
My suggestion is to:
1. identify all of the sequences in the 340 and mathematically score them somehow. Longer sequences would have a higher mathematical score than shorter sequences. For example, ABABABABAB could have a score of 10. CDCD could have a score of 4. EFGHEFGH could have a score of 8. IJKI*K (where the * represents a missing symbol), would have a score of only 5 because of the missing symbol.
3. Randomize the 340 and find and score all of the sequences in the randomized. Do this numerous times perhaps for better statitstics.
4. Compare the scores from actual 340 sequences to the scores from the randomized 340 sequences. You could do this with two bell curves. If Zodiac used sequences in the actual 340, then the bell curve for the actual 340 sequence scores would be different than the bell curve for the randomized 340 sequence scores. The actual 340 bell curve scores would be outside and to the right of the randomized 340 bell curve scores.
5. At this point, the symbol sequences with scores that are higher than the randomized 340 scores could be attacked with frequency analysis.
My scoring formula is very rudimentary, and needs some work. I have a fairly elaborate Excel spreadsheet that finds and scores sequences, but I do not have any computer programming skills beyond that.
If anyone with computer programming abilities is interested, it would be nice to know if Zodiac used sequences in the 340, or if the sequences in the 340 are just random. Some of the sequences in the 340 are quite long, and probably not random. If we knew that the sequences were not random, then that would eliminate so many other ways to attack the 340, including finding different routes in the message other than traditional. It would mean that Zodiac worked from left to right, from the top of the message down, line by line.
A while ago, I wrote a program that looks for sequences in the 408 and the 340. Here are the results:
http://zodiackillerciphers.com/wiki/ind … _sequences
A while ago, I wrote a program that looks for sequences in the 408 and the 340. Here are the results:
That is very interesting. I see that you scored the sequences. I’ll have to look at this more closely later. Thank you very much.
There are some high scoring long or multi-repeated sequences with a one or more missing symbols. My thinking is that there may be a few symbols that Zodiac used as wildcards, which often appear in the message where the missing sequence symbols should be. He did that with one or two symbols on the 408.
See the "q" and "+" symbols (which I call symbols 5 and 19). Take any high scoring long or multi-repeated sequence with a missing symbol. Find the string of symbols in the message where you would expect to find the missing symbol, and you will often find a "q" or "+" there. That would explain why Zodiac could have used sequences, but also have "qq" and "++" appear in the message side by side like they do. That would also explain why the "q" and "+" symbols are not included in any sequences with other symbols (except that symbol "q" cycles with symbol "<" (symbol 29) in the second half of the message).
Note that if my theory is true, then the total quantity of any symbol sequence with a missing symbol will be higher to account for the wildcard symbol. And that would affect the comparison between frequency of symbol(s) and frequency for a letter.
Would you be interested in randomizing the 340, and finding the sequences in that? Compare the results to the actual 340 sequences to find and isolate the actual 340 sequences that are most likely not random. Where do we draw the line between what is most likely a genuine Zodiac created sequence and what is possibly a random sequence? Or do your probability formulas tell us that?
EDIT: I have been reviewing your tables, and am impressed. I see now at the bottom that you have considered randomizing and comparing. If we did that numerous times and made a distribution of the scores, we could work on finding the probability that any particular 340 sequence is author created instead of random.
Below is a distribution of two symbol sequence scores. Randomization (only one) is in blue, Actual 340 is in red. The X axis is a rudimentary scoring system by percentage of symbols that are "bracketed" by symbols that that they should be "bracketed" by. The Y axis is the number of sequences with that score.
Some of the red sequences on the far right are the ones that we could attack, starting with the ones with the highest probability that they are author made.
Do you follow my thinking?
There are some high scoring long or multi-repeated sequences with a one or more missing symbols. My thinking is that there may be a few symbols that Zodiac used as wildcards, which often appear in the message where the missing sequence symbols should be. He did that with one or two symbols on the 480.
See the "p" and "+" symbols (which I call symbols 5 and 19). Take any high scoring long or multi-repeated sequence with a missing symbol. Find the string of symbols in the message where you would expect to find the missing symbol, and you will often find a "p" or "+" there. That would explain why Zodiac could have used sequences, but also have "pp" and "++" appear in the message side by side like they do. That would also explain why the "p" and "+" symbols are not included in any sequences with other symbols (except that symbol "p" cycles with symbol "<" (symbol 29) in the second half of the message).
If you don’t mind jumping me in. If your theory is true/statistically significant then it is an exciting discovery. Your wildcards would essentialy be a hint towards polyalphabetism of some kind right? The symbols you mention (numbered by appearance) are my top 2 finds for causing interruptions in the 340. But if only these symbols would be polyalphabetic, would that explain the lack of ngrams in the horizontal direction and the cipher not solving in ZKDecrypto? The cipher does score higher after removing the "+" symbol though.
Could we possibly come up with a restoration?
If we consider each + to be a wildcard, each standing for one of the symbols that is involved in high-scoring homophone sequences, then maybe we could conduct a brute force search for all replacements of the + symbol that yield "fixed’ sequences.
It’s somewhat feasible if you limit the search to 3 symbols (for example, for each +, you pick from one of only three possible symbols). This would yield 282 billion candidates that could be scored.
But if we consider more than 3 symbols, it gets to be infeasible to look through them all, so we’d need to do some kind of hillclimbing search.
If we consider each + to be a wildcard, each standing for one of the symbols that is involved in high-scoring homophone sequences, then maybe we could conduct a brute force search for all replacements of the + symbol that yield "fixed’ sequences.
It’s somewhat feasible if you limit the search to 3 symbols (for example, for each +, you pick from one of only three possible symbols). This would yield 282 billion candidates that could be scored.
But if we consider more than 3 symbols, it gets to be infeasible to look through them all, so we’d need to do some kind of hillclimbing search.
That’s a very good idea doranchak (hillclimbing the symbols).
I superimposed the "+" symbol onto a 340 character part of the 408 and also then removed it. I’m trying to get a solve on these. ZKDecrypto is running in the background but hasn’t found it so far. Also ngram counts seem to be lower. Could the Zodiac possibly have tried to mask bigrams visually with the "+" symbol etc? That would also be very much in line with smokie treats wildcard idea. And it was a problem with the 408.
ZKDecrypto found a weak solve to the 408plusover after 17 minutes on my i5. Score 32066, quite readable actually. But it is only a 53 symbol cipher.
Thank you for working on this. I have many Excel spreadsheets but I cannot do what you are doing. All I know how to do is make observations and find patterns.
There are three other possible wildcards (symbols that could represent multiple letters):
"B", which I call symbol 20, appears 12 times in the message, but does not cycle with any other symbol in a high scoring sequence.
"5", which I call symbol 50, appears 11 times in the message, but does not cycle with any other symbol in a high scoring sequence.
"F", which I call symbol 51, appears 10 times in the message, but does not cycle with any other symbol in a high scoring sequence.
These also should make appearances in the message where a high scoring sequence is missing a symbol.
I think that Zodiac used multiple symbols in sequences, just like in the 408, to represent some letters. But he made the 340 more confusing by using the wildcards to represent multiple letters. Some of the wildcards, or maybe all of them, are used as substitutes in the sequences. That’s why many of the longer, higher scoring sequences are not perfect.
At the moment I’m totally convinced that this is the problem with the 340.
Smokie treats, can you identify some probable symbol substitutes for some of the wildcard symbols? Even only a few candidates would be really helpful. Thank you for posting some additional possible wildcard symbols!
Best ZKDecrypto of 340uniplus.txt:
Best AZdecrypt of 340uniplus.txt (experimental version with up to 6-grams)
myewindhandllethe rinusothisleastap usfrnmentssploome iiicdndshsitiveth ernformedtoaccons idedonthsemnowhis otztdhwheretheold stheamotherlocalr eandmenehoursabot honehadenticaoush erestrydiluimentt harneedsteonewcuo obrisofrdtourswhe ntordathomerhorsu sualsformywhentha dresoutstideandth ereallbmaponsudal kwasposaatimessai donesacrueltushpa wnimrinionthadtwy
There are similarities.
Smokie treats, can you identify some probable symbol substitutes for some of the wildcard symbols? Even only a few candidates would be really helpful.
I am not sure what you mean. Would you like me to find places where the wildcard symbols are used to substitute for the missing symbols in the higher scoring sequences? Is that what you are asking for?
EDIT: O.k. I think that you are asking me to figure out, fore example, where a "+" could substitute for another symbol, and also tell you what that symbol is. Is that what you want?
Smokie treats, can you identify some probable symbol substitutes for some of the wildcard symbols? Even only a few candidates would be really helpful.
I am not sure what you mean. Would you like me to find places where the wildcard symbols are used to substitute for the missing symbols in the higher scoring sequences? Is that what you are asking for?
EDIT: O.k. I think that you are asking me to figure out, fore example, where a "+" could substitute for another symbol, and also tell you what that symbol is. Is that what you want?
And if so many symbols could potentially be wildcards we may have a big problem with finding the solution. We need a way to come up with a list of probable wildcard symbols (your work) and also a list of probable symbol substitutes for each individual wildcard symbol. And then generate permutations of the 340 based on these lists and score them in a solver.
O.k., let me dig into my notes some more. I have dozens of spreadsheets that I haven’t worked on in a year. One of them ranks the symbols according to the total score of all of the sequences that the symbol in included in. My scoring system is much more rudimentary. Let me find my work. I’ll be back soon.
I will work on a hill climber that scores candidate wildcard substitutions.
"5", which I call symbol 50, appears 11 times in the message, but does not cycle with any other symbol in a high scoring sequence.
Symbol number 50 by order of appearance only appears 7 times. The only symbol that appears 11 times is number 5.
So your list of possible wildcards would be (symbol number by order of appearance): 5, 19, 20 and 51. Right?