CIPHER STRUCTURE

Quicktrader · 2013-04-07T14:58:34Z

Quicktrader, Subject: CIPHER STRUCTURE Thu Dec 27, 2012 8:02 am Ok, had a better look on the 408 cipher structure to better understand the framework of the 340 and the way how Z had 'worked'. First to clarify with the 'errors' Z had made in his 408 cipher: FORREST, EXPERENCE and PARADICE presumably have been made by lack of knowledge. 'ANAMAL' (instead of 'animal') and 'MOAT' (instead of 'most') however were caused because Z had made one simple error by mixing 3-4 triangle-like symbols (accidentially or by purpose): Instead of using (representing the letter 'S') he had used the (representing the letter 'A'). This lead to the two 'MOAT' errors. Also he had used the (representing the letter 'A') instead of using which would've represented the letter 'I'. This lead to the 'ANAMAL' error. All of those three symbols look quite similar, which is why I think that Z had made this error accidentially. There is another switch of the to the - also similar - symbol, which would have correctly represented the letter 'W'. This lead to the 'SLOI' error. When correcting/ignoring those errors, you get a solid structure of how Z had set-up his homophone cipher by using sequences: While about 45 symbols seem to not be part of the solid sequence structure and even 6 errors still occur, most of the cipher follows these sequences (green area, representing 87.5%). This is why I believe that looking for sequences in the 340 is the best approach. The overall length of the sequences for each letter therefore is: E - 7 (plus 'interruptors') T - 4 A - 4 I - 4 N - 5>4>4>4>3>3 (decreasing) S - 4 R - 3 O - 3 L - 3 (mixed) H - 2 F - 2 D - 2 (first one mixed) B - 1 C - 1 G - 1 K - 1 M - 1 P - 1 U - 1 V - 1 W - 1 X - 1 Y - 1 J - 0 Q - 0 Z - 0 According to this, Z had used sequences mainly for the more frequent letters while almost half of the letters (the less frequent) have been replaced by one symbol only. The 340 has more homophones, therefore it may be assumed that the sequences for ETAINSROLHFD would be either longer or at least more present (or both). Most sequences with a length of 5 or longer presumably would have been used for the ETAOIN letter family (to hide any cipher structure). Z also had differed the way of sequencing, e.g. with the letter 'N' (decreasing) or the letter 'L' (completely mixed, although consistantly using three symbols). With 'L' he had tried to hide any structures such as 'will', 'kill' etc. It might be assumed that 2-dump-n-grams would rely to double-letters, such as mm, ff, dd, ss. If used once in a sequence, the symbols occured more often. This is why it may be assumed that rare symbols, such as the square with the spot inside, indeed represent rare letters, such as letters from the JQXZ family, too. Overall, if Z had used a similar ciphering method, the last third of the cipher is less relevant as the sequences dilute more the longer the cipher is. By figuring out the top e.g. 5 sequences with a length of >4, this could lead to 5! = 5*4*3*2*1 = 120 variations of the cipher. All of those variations would show up with approx. 40% of the cipher filled out, only one of them being the right combination of letters and sequences. It seems to be a fact that the 340 has only shorter sequences to be recognized, however more homophones were used to make the cipher 'harder' to crack (+500m possible variations = level of Z's idiocy..). Please also be aware that for getting the cleartext, the sequence itself is not relevant as all symbols of one sequence do represent the same letter..(cipher errors are the real troublemakers). Overall it should also be considered that the change of sequences might have not been made by purpose or accidentially, but rather by using a cipher that is more developed than a 'simple' homophone cipher, e.g. by adding such asymmetry, therefore the sequences all being correctly following such asymmetric enciphering method. Furthermore it might be assumed that symbols that occur regularly in the beginning of the cipher, however don't appear at the end anymore, could be parts of decreasing sequences and vice versa. As there is no need for setting up a sequence, the letters of the BCGKMPWY might also rather occur as a single substitution (letters that are not very frequent but presumably appear at least >3-5 times in the 340 cipher). QT QueenOfClews, Subject: Re: CIPHER STRUCTURE Fri Feb 08, 2013 3:32 pm I first want to say that I am very impressed by the level of sophistication and detail of the discussions and analyses in this thread. I get the impression when reading some of the letters that there is some connection between Z's misspellings and his code. It makes me think that my brain is recognizing something on a subtle level, but cannot bring the connection to the conscious thought level. In an effort to flush the idea out a bit, I started going through the letters and finding all the instances of misspellings. I entered them into a spreadsheet in a format which docum...

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

I am working on fixing my mistake with Symbol 50, and will post a new chart showing the total count and scores for each symbol.

I will also work up some sort of simple way to compare:

1. Symbols that are low in count and are not in sequences (score = low)
2. Symbols that are low in count and are in sequences (score = high)
3. Symbols that are high in count and are in sequences (score = high); and
4. Symbols that are high in count and are not in sequences (score = low and what I think may be wildcards).

By grouping the symbols this way and looking at how the groups compare to each other, maybe we can learn something else. I don’t know.

S.T.

Posted : May 31, 2015 1:53 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

O.k., here is my new chart.

It shows the distribution of scores for each symbol as compared to every other symbol. Two symbol sequences only, as I think that initially that is all that is necessary. A symbol with a higher total score has more sequential relationships with other symbols. A symbol with a lower score will have fewer or no relationships with other symbols.

Next I will take out all of the stuff on the right and start grouping symbols according to score and count to show that some symbols are very different than others. Maybe this will help to verify whether there is any statistical significance in the data.

S.T.

Posted : May 31, 2015 3:49 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

1. Symbols that are low in count and are (probably) not in sequences (score = low)

Obviously Symbol 60, the "@", only appears once, and therefore could not be in any sequence.

2. Symbols that are low in count and are in sequences, depending on the count above 50% (score = high)

Posted : May 31, 2015 4:16 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

3. Symbols that are high in count and are in sequences, depending on the how many score above 50% (score = high)

4. Symbols that are high in count and are not in sequences (score = low and what I think may be wildcards)

Posted : May 31, 2015 4:43 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

I am sure that you guys follow me, but I need to summarize for myself too.

1. Check out the distribution shift at the 50% score line when comparing Actual 340 scores and ONE randomized 340 score:

Because of this, and the fact that Zodiac used sequences in the 408, I am convinced that Zodiac used sequences in the 340. But many of the sequences in the 340 are also random. What needs to be done is randomize the 340 about 30 times, and make a distribution that shows means of the randomized scores. That will more clearly define a cutting off point and how we can calculate the probability that any two-symbol sequence is a Zodiac sequence or a random sequence.

2. Many of the longer sequences are most likely Zodiac sequences, because of their score. Also, almost if not all of them have missing symbols. I have looked at a lot of them, many which have more than two symbols, and there is often one of my proposed wildcard symbols found where it should be if substituted for a sequence symbol.

Now to argue against myself, there are a total of 57 of my proposed wildcard symbols in the message ("F" at 10 + "q" at 11 + "B" at 12 + "+" at 24 = 57), which makes the probability of this high. But, these wildcard symbols are high in count and do not sequence with other symbols well. If they did sequence with other symbols, then they wouldn’t be sitting next to each other, like the "qq" and "++". And, there are a lot of missing symbols in all of the high scoring sequences, collectively speaking.

3. This is the closest system to what Zodiac used with the 408. It is only a minor departure from his 408 system. In fact, he did use one or two symbols in more than one sequence in the 408. That’s what gave me the idea. See http://www.ciphermysteries.com/2011/09/ … 40-ciphers (the solid triangle cipher shape in Z408, and how it appears to encipher different letters at different times. The view often put forward elsewhere is that this varied due to copying errors, perhaps arising because the Zodiac Killer’s pen was too thick, causing him to misread his draft version).

It’s just a theory. But there is a statistical phenomenon with "F", "q", "B" and "+". There may be others. What about "C", my Symbol 56? That one doesn’t cycle well with any of the other symbols either.*

Similar to what Jarlve said, we can find some high scoring symbol sequences that have mostly mutually exclusive symbols that do not appear in other sequences (which is not easy, by the way – Zodiac may have had a complicated cycle system just to mess with our heads but with no message). Check to see what happens when looking for one of the missing symbols, and see if there isn’t a wildcard symbol there in the message.

A lot more work needs to be done, but this is my best guess as to what Zodiac did. He just decided to make the 340 more difficult than the 480, and substituted a bunch of wildcard symbols in the sequences.

We really need to sort through all of the sequences and figure out the probability of genuineness for each one. I suggest multiple randomizations of the 340 for comparison, which is beyond my skill level. Or what about just deleting the proposed wildcards, and possibly the "c" to see what happens then? What about finding the sequences that are most likely Zodiac made, and tallying up the number of missing symbols. Compare to the tally of wildcard symbols.

Any thoughts? Think about all of the different ways that we can figure this out if it is true.

Smokie

EDIT: Your symbol "C", which is my symbol 56, does cycle with your symbol |, which is my symbol 11, in the second half of the message. It could be a random coincidence.

Posted : May 31, 2015 5:11 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Because of this, and the fact that Zodiac used sequences in the 480, I am convinced that Zodiac used sequences in the 340. But many of the sequences in the 340 are also random. What needs to be done is randomize the 340 about 30 times, and make a distribution that shows means of the randomized scores. That will more clearly define a cutting off point and how we can calculate the probability that any two-symbol sequence is a Zodiac sequence or a random sequence.

100 randomised 340’s. If you need anything else that can be achieved programmaticly let me know and I’ll see what I can do. I’m not a big fan of comparing the 340 to randomised counterparts because the 340 is not random at all.

2. Many of the longer sequences are most likely Zodiac sequences, because of their score. Also, almost if not all of them have missing symbols. I have looked at a lot of them, many which have more than two symbols, and there is often one of my proposed wildcard symbols found where it should be if substituted for a sequence symbol.

Yes that’s your theory, I guess doranchak is working on some kind of verification. I’m wondering if what you see could also be randomness in the cycle, because he did that in the 408 as well. For some time now I believed that some symbols are not part of the cipher because of a simple counting method (non-repeats) I came up with that can to some extent measure how cyclic a homophonic substitution cipher is. The theory I came up with is that the 340 is a word search and that some symbols are filler for the empty squares. So that’s why I want to have verification also.

In my thread a diagonal shift (highly experimental stuff in there) I did a crude test removing a random number of random symbols for a large number of iterations scoring the non-repeats and I came up with the following image showing the larger interruptions in the cipher as darker squares. Does it look familiar?

We really need to sort through all of the sequences and figure out the probability of genuineness for each one. I suggest multiple randomizations of the 340 for comparison, which is beyond my skill level. Or what about just deleting the proposed wildcards, and possibly the "c" to see what happens then? What about finding the sequences that are most likely Zodiac made, and tallying up the number of missing symbols. Compare to the tally of wildcard symbols.

I can do a crude test with the non-repeats.

The 340 scores 4462 normally, higher = more cyclic. I will attempt to normalise this number by the amount of symbols and characters.
After removing 5, 19, 20, 51: 6028! These symbols do interrupt the cycles.
After removing 5, 19, 20, 36, 51: 5941. I’m not convinced.

By the way this is the string of removed wildcards: "pp+BB+ppp+++F+pFF+++++FB++pFB+Bp+BF+FBF++B+Bp++FB++pFBBp+".

AZdecrypt

Posted : May 31, 2015 12:44 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

Haven’t had time to do statistical comparisons to random shuffles yet but I re-ran my brute force search for L=4 on Z340, Z408, and 3 shuffles of Z340.

Z340: https://docs.google.com/spreadsheets/d/ … sp=sharing
Z408: https://docs.google.com/spreadsheets/d/ … sp=sharing
Z340 shuffled #1: https://docs.google.com/spreadsheets/d/ … sp=sharing
Z340 shuffled #2: https://docs.google.com/spreadsheets/d/ … sp=sharing
Z340 shuffled #3: https://docs.google.com/spreadsheets/d/ … sp=sharing

Only sequences that occur at least twice are considered. The probability calculation is different this time. I simply computed the product of all the symbol frequencies divided by the cipher length (i.e., f1/340 * f2/340 * f3/340 … etc …).

Shuffle #1 seems to have some low probability cycles in it, so I’m curious what the statistics will end up showing if I run more shuffle tests.

The best pattern in Z408 shows up with probability 0 only because the float value was so low, it ended up getting rounded down to 0.

http://zodiackillerciphers.com

Posted : May 31, 2015 1:14 pm

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

I guess in scientific terms, it is really a hypothesis because it is an explanation for observed phenomenon. What makes it attractive to me, besides that it seems to explain what I observe, is that it is not a big departure from the 408.

I sincerely appreciate your interest, and would not be disappointed at all if someone proves the hypothesis wrong. That’s how scientists work together.

However, you guys are really over my head with the computer programming. I wrote my own video game programs on a Commodore 64 many years ago, but then entered a different trade. So I understand a little bit about what you are doing. I also took statistics in college and got an A, and I know Excel very well from my former job. Those are my tools.

I get sucked into this stuff about once or twice a year. I just cannot let it go. I’ll find some more stuff to post in a bit, to sort of show you what I have been looking at after trying to find all of the sequences manually.

Smokie

Posted : May 31, 2015 4:20 pm

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

O.k., I have been working on a few examples of my observations.

Symbols 23, 29 and 38 ( O,<,U ) all cycle with each other really well in a multi repeated sequence, but there are missing symbols. There are Symbols 5, 19, 20 and 50 in places where the missing symbols should be.

I’ll be back later.

S.T.

Posted : May 31, 2015 6:44 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

It certainly is a hypothesis worth exploring further.

So, I ran a quick experiment. In the 340, the cycle "RlNZ" appears, like this: [RlNZ] lR [RlNZ] RlRlRN [RlNZ] [RlNZ]. In my list of 4-symbol cycles ( https://docs.google.com/spreadsheets/d/ … =892030580), the "RlNZ" cycle is the best one, because the relative probability based on symbol frequencies is the lowest in the entire list: 4.93E-29. The number 4.93E-29 is not meant to convey the exact probability that the sequence occurred; instead, it is meant to convey that it is less probable to occur by chance than the other sequences.

That cycle seems rather strong, like it might be intentionally created by the cipher author.

I wanted to answer this question: How many times do you have to shuffle the 340 ciphertext before a sequence with similar relative probably appears completely by chance? I.e., how many shuffles does it take before you reach a probability estimate equal to or less than the best one in the 340 for L=4?

Well, I coded it up and it took only 11 shuffles:

Symbols: #5kN
Sequence: 5 [#5kN] [#5kN] # [#5kN] 55Nk [#5kN]
Shuffled ciphertext:

|y2BcMcXG(98BcU9F
BF*6CTKVEM+Z)q<Y8
5;/c1JM^(#Fl9KD<R
)+C)5p-CLzlV.p2HR
2/ZpRB:>+yMkKU/4^
(_dSbZBB+NEt|dWzd
&PL6c+B#5K.BCPqkH
PO++^>p*NKLVYppB6
f#|lG7cFOF#|W<%|Z
OS8-75G.2EJO+R4R+
4(G2pYR-jt-:+kzj*
|+|z*NYVX1c.Jz2pz
)OlFd|2+.l;%p5O+T
BF3Azy<ffM@<+TO+W
-9)t(C+Hp.MADK+G7
llO;+|^(F8zWVV+cB
5WD2_MTFOFpcWdy>U
b31NKtTJ424y+SkS#
>LB+5^_<+k*R^OLRU
+b&zc|G*f(+4ULDNH

I was surprised how few shuffles were required, so I ran it again, and it took only one shuffle!

Symbols: .^Gy
Sequence: [.^Gy] [.^Gy] .G.^ [.^Gy] yG [.^Gy] ^
Shuffled ciphertext:

O.A^BU/Tt+<FG#|Fy
4.TY_4%2cZkzXMW21
(#7cWl)M*8+p2^BNC
+YljKdJp962cRN#k(
VzG-LF<(>U+ydZNR.
p+VGc>+P|zbOT*|5.
+HSZF^*++|UKfBB-F
.YKk)+C_5+Y++J27z
dMW|*9OFO:)PETFB|
LW(p6(3VK&t|c-Ep<
f5FCz8z/R;pR&Sc9q
L+5-<(<dfAcR8ODOC
tBWSj^NlBb|5lO4Vk
+%#E+JR-tppPD+l/6
OLOU<(WG8MKyzpB*B
K+f):5S75LT>y;+b+
1FJXql4)_cVG+R+p+
L4RNOk+2c4lUzH.C3
#BD;1MpMB*9ZHH^|2
MGVcK2@2Fy^d>Bz|D

Ran shuffles one more time, and it took only 7 tries:

Symbols: 9^lG
Sequence: Gl [9^lG] [9^lG] G [9^lG] ^ll [9^lG] ^
Shuffled ciphertext:

dMGl>#3F*.CqOcpBd
p%(cR9dF+BKT;@fp_
N4.D|p6_CXT-<zY+^
OyV2Oz+V;KL6l)+EA
F+%TWSWMG4pkUFV8L
1W(9|WZzO5+^pc/7S
<+2D-H2*>>qt62BBp
M&+(-)l|.-pG5FcR*
M)YSEt(5t2N7T+KbH
G)394AKpy#4Ht4(N2
8HFN#VUW^|OU*V+<#
z1WUR|cBy/JBkcD(:
-2yk#*_z|J&zPl|)K
KzD4++ZOObF|G.+C+
J+Y^CCFORcbc+:zT5
F>2l.L5KcNl5fF|dd
k|/V<RMf9S^+lL<52
yBOEGc+PLXL;pZBzY
8UBRpP<+Rj++R8+MO
^j*Jf.+ZB(+7kBM1B

So – I’m back to the question: Are the sequences we’re seeing random or intentional cycles? I think you are right, that we need to compare the cycle score distributions of shuffled vs original.

It could also be the case that the cycles simply don’t appear strong, due to being interrupted by the wildcard symbols.

http://zodiackillerciphers.com

Posted : May 31, 2015 11:36 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

The situation is a little different with L=3, because the cycle l*M, which appears almost perfectly as [l*M] [l*M] [l*M] lM [l*M] [l*M] [l*M], has a much lower relative probability (1.7517163E-31) than any of the L=4 cycles.

To get an equally good or better sequence to appear randomly, I had to run 584 shuffles:

Symbols: ^K5
Sequence: [^K5] [^K5] 5K [^K5] [^K5] [^K5] [^K5]
Shuffled ciphertext:

<kd+M6W+M7q9PBczW
O^:K(L4%+Cy+RcJJk
TR||zbO2TpB/>-t+5
2<Zd+&lM<WcLZ+.M+
yE2z+k<^-G6jK+B<L
;-Y|j(+)-lpTXFt5.
X|pU.BA6JMcH>cbCF
*+k1OOR8Dfzzp.2(O
d5LU++SKBtcP(WTG#
^K>EO/+5*N3+^T9#D
7z2MF)U+yGNG)_cY1
F3Z9lF/8(RUKF8D)B
dRBVlFHV>VOVf(Bq*
UN+48B-CPAdMcy2p&
|zRblVSOB|Nl#.C5+
W%)D4Y^*+Kk4Oyc.t
;|OE|J2#LpS7Flfcp
5F;ZpzG*p2_H|LN+W
VBB^H+p*GSz+(RC44
:_KR|<@#p92f15+FY

http://zodiackillerciphers.com

Posted : June 1, 2015 12:16 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

O.k. I realize that the first order of business is to verify once and for all whether Zodiac used sequences in the 340. Your most recent results are very exciting.

You found what I call sequence 6 30 37. I have a score of 80% for that with my rudimentary system.

Note that my * in the top green shaded string of symbols is for the wildcard, and it is not a symbol.

See that there are several wildcards in the string where the missing 30 (or your "*") should be.

I am saying that if this sequence is Zodiac made, and the wildcard hypothesis is correct, then those symbols should represent H, N, S, L, A, R, I, O, in that order of probability.

Again, the first order of business is to show one way or the other if Zodiac used sequences.

The rest of my hypothesis can be worked on later; the above serves to illustrate what I was talking about.

Thank you so much.

Smokie

Posted : June 1, 2015 2:57 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

The 340 uses cycles.

Ran a couple of tests with 10 million randomisations each counting the non-repeats and the highest I’ve gotten was 4350, on average 2950 or so. In the results that were the highest the "+" symbol was contributing towards the cycles. Just saying that if the 340 per chance would have bumped to 4462 where it is now, I guess it will also be likely that the higher frequency symbols would count towards that (which is not the case for the 340). Maybe there’s almost as much chance for outliers to occur in a randomised counterpart just by the sheer amount of combinations your checking… Take any shuffle and I bet you’ll find some outlier in there which will also be rare if you just look far enough.

I believe it will be very hard to really distinguish between real and chance cycles. I would start with a textbook example which cycles without error. Hill climb your probability system to increasingly better results (fake/real cycles). Use all information you can extract about cycles, positions, distances, symbols in between. A cycle which is real will probably have a better spread throughout the cipher since that is the nature of cyclic homophonic, etc. Would like to help with a system of my own but I’m preoccupied with working on my solver, if this theory pans out we need a solver which can do higher multiplicity than the ones available.

Keep up the good job andalso take it easy!

AZdecrypt

Posted : June 1, 2015 10:20 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

Hey guys, thanks for working on this.

I was looking back through all of my spreadsheets and found one randomization that I did for 3 symbol cycles and forgot about. It probably took me a day to do it.

Here is a simple idea about how to tackle this:

Check out the distribution below. The Zodiac 340 scores, totaled by percentage, are in red. The ONE randomization score, totaled by percentage, are in blue.

Now that doranchak has identified all of the cycles, we need to find the probability for each cycle that Zodiac made the cycle, and then attack the ones with the highest confidences.

See the 50% mark, for example. There are 215 L-3 cycles in the Zodiac 340 that score 50%. There are 84 L-3 cycles (you would have to see the spreadsheet) in the Randomized 340 that score 50%. The probability that any particular 50% L-3 cycle in the Zodiac 340 is author made should be: (215-84)/215=61%.

Now to the 71% mark. There are 16 L-3 cycles in the Zodiac 340 that score 71%. There are only 3 L-3 cycles (you would have to see the spreadsheet) in the Randomized 340 that score 71%. The probability that any particular 71% L-3 cycle in the Zodiac 340 is author made should be: (16-3)/16=81%.

What really needs to be done is to generate only 30 randomizations to find good statistical data to work with. Then find the mean (and possibly the standard deviations) for each score percentage. Make a distribution graph and compare to the Z340 graph. Do a separate graph for L-2, L-3, etc.

Then, with our list of cycles, we can work with the highest confidence cycles to try to solve the Z340.

The wildcard hypothesis may fizzle out, I don’t know. But either way, we will have a list of cycles to work with and know for a fact for any particular cycle how probable it is that Zodiac made it. The cycles need to be sorted into different groups to work with.

I don’t know what you guys are talking about sometimes, and if you find a more efficient way to do this, then that is fine. I loved doranchak’s random shuffle idea because it was a very efficient test to find whether we should move forward.

My humble suggestion is to keep it simple so that other people (not just me) can work with the data on it if necessary. So that people can see that Zodiac DID use cycles, see how the cycles are sorted into confidence groups, and see what cycles to attack. It is possible that the message could reveal itself very quickly with your computer programs.

Smokie

Another thought: All of the symbols in each high confidence cycle could be replaced with only one new symbol that represents that cycle. That may make solving easier.

Posted : June 1, 2015 1:41 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

What really needs to be done is to generate only 30 randomizations to find good statistical data to work with. Then find the mean (and possibly the standard deviations) for each score percentage. Make a distribution graph and compare to the Z340 graph. Do a separate graph for L-2, L-3, etc.

You may not have seen this 100 randomised 340’s. I posted it earlier. If you need anything else let me know.

AZdecrypt

Posted : June 1, 2015 2:22 pm

Zodiac Discussion Forum