Homophonic substitution

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Topic starter

This thread is a continuation of viewtopic.php?f=81&t=267 in which several aspects of the Zodiac 340 cipher are discussed and researched. I’d like to continue the work from there in this thread because then I can use the main post to reference and update all the cipher material being discussed.

Some of the questions which the contributors are trying to answer:

– Is the 340 a straightforward homophonic substitution cipher or is there something else going on?
– The 340 does not seem to cycle as well as the 408, what is going on? (doranchak: http://zodiackillerciphers.com/wiki/ind … _sequences)
– To what extent is the 340 cyclic or random? Can we find areas – as for instance with the last part of the 408 – that are more random?
– Is it possible to attribute the 340 not cycling as well as the 408 (despite its higher symbol count) due to some transposition after encoding?
– Some of the medium-high count symbols do not seem to cycle well, are these possibly wildcards/polyalphabetic or 1:1 substitutes? (smokie treats)
– Can we make a system that can adequately group homophones that belong to the same letter without having to solve the cipher? (smokie treats, glurk)
– Is there a discrepancy between symbols/cycles/etc on odd and even positions for the 340? If so, what could be causing this? (daikon, doranchak, smokie treats)
– There is a significant bigram repeat peak at period 19, is this a lead to the encryption scheme of the 340? (daikon)

340 cipher numeric and symbolic version:

1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17
18 5  19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
20 34 35 36 37 19 38 39 15 26 21 33 13 22 40 1  41
42 5  5  43 7  6  44 30 8  45 5  23 19 19 3  31 16
46 47 37 19 40 48 49 17 11 50 51 9  19 52 53 10 54
5  44 3  7  51 6  23 55 30 17 56 10 51 4  16 25 21
22 50 19 31 57 24 58 16 38 36 59 15 8  28 40 13 11
21 15 16 41 32 49 22 23 19 46 18 27 40 19 60 13 47
17 29 37 19 61 19 39 3  16 51 20 36 34 62 63 53 31
55 40 6  38 8  19 7  41 19 23 5  43 29 51 20 34 55
38 19 3  54 50 48 2  11 25 27 20 5  61 14 37 31 23
16 29 36 6  3  41 11 30 50 14 53 37 28 19 52 20 51
40 63 47 42 34 22 19 18 11 50 51 20 36 21 58 44 3
6  15 51 18 7  32 50 16 53 61 28 36 8  53 48 19 19
34 20 59 12 30 35 53 47 56 2  4  8  38 39 50 55 19
11 36 28 45 40 20 31 21 23 5  7  28 32 37 57 15 16
3  36 14 19 13 12 63 56 29 19 51 6  26 20 11 33 13
19 19 33 26 56 40 26 36 9  23 42 1  14 54 21 33 5
11 51 10 17 26 29 43 48 20 46 27 23 20 30 55 56 36
4  37 25 1  18 5  10 42 40 39 23 44 62 11 31 58 19

HER>pl^VPk|1LTG2d
Np+B(#O%DWY.<*Kf)
By:cM+UZGW()L#zHJ
Spp7^l8*V3pO++RK2
_9M+ztjd|5FP+&4k/
p8R^FlO-*dCkF>2D(
#5+Kq%;2UcXGV.zL|
(G2Jfj#O+_NYz+@L9
d<M+b+ZR2FBcyA64K
-zlUV+^J+Op7<FBy-
U+R/5tE|DYBpbTMKO
2<clRJ|*5T4M.+&BF
z69Sy#+N|5FBc(;8R
lGFN^f524b.cV4t++
yBX1*:49CE>VUZ5-+
|c.3zBK(Op^.fMqG2
RcT+L16C<+FlWB|)L
++)WCzWcPOSHT/()p
|FkdW<7tB_YOB*-Cc
>MDHNpkSzZO8A|K;+

Alterations of the 340:
– In relation to the bigram peak at period 19:
Scheme: move 1 row down, 2 columns right and repeat (wrap around cipher): 340_1rd-2cr-w.txt (doranchak)
Grid 19 by 18, direction North-East (vertical) and 2 "?" symbols added: 340_19by18_n-e.txt
Grid 20 by 17, direction SW-SE (diagonal): 340_20by17_sw-se.txt
Grid 17 by 19, 17 symbols filler at end, vertically untransposed: 340_323_17.txt (smokie treats)
Grid 17 by 20, 16 symbols filler at end, vertically untransposed: 340_324_16.txt (smokie treats)
Grid 17 by 20, 15 symbols filler at end, vertically untransposed: 340_325_15.txt (smokie treats)
Grid 17 by 20, 14 symbols filler at end, vertically untransposed: 340_326_14.txt (smokie treats)
Grid 17 by 20, 13 symbols filler at end, vertically untransposed: 340_327_13.txt (smokie treats)
– In relation to the odd/even encoding scheme:
Evens only: 340evens.txt
Odds only: 340odds.txt
Randomized, shuffled: 340shuffled.txt (doranchak)

Tools/links/solvers:
– David Oranchak
Zodiac Killer Ciphers: http://www.zodiackillerciphers.com/
Zodiac Ciphers wiki: http://zodiackillerciphers.com/wiki/ind … =Main_Page
CryptoScope: http://www.oranchak.com/zodiac/webtoy/stats.html ,
340 Webtoy: http://www.oranchak.com/zodiac/webtoy/index.html
Zodiac Pattern Drawer: http://zodiackillerciphers.com/zodiac-pattern-drawer | (info)
Word Search Gadget: http://zodiackillerciphers.com/word-search-gadget
– glurk
ZKDecrypto: https://code.google.com/p/zkdecrypto/ and viewtopic.php?f=81&t=2268
– Michael Cole
The Zodiac Revisited: http://zodiacrevisited.com/
– Jarlve
AZdecrypt: http://zodiackillersite.com/viewtopic.php?f=81&t=2435

Visualizations:
– In relation to the bigram peak at period 19 and 15 (mirrored 340):
Doranchak’s ngram viewer.
Doranchak’s period calculator.
Doranchak’s fragment explorer.

Test ciphers:
I’d like to introduce a whole new range of ciphers to test on, mainly being homophonic substitution but with different schemes. More will be added and particular schemes can be requested. All of these ciphers can have low count 1:1 substitutes. Please use the proper names of the ciphers when referencing them. There should be no errors in these ciphers but the number of homophones per letter were handpicked each time to introduce a human element.

Perfect cycles:
c_p1.txt
c_p2.txt
c_p3.txt

Randomization of cycles:
(the number behind the "r" in the file format stands for the chance of randomization to occur, 1=0.1, 2=0.2, 3=0.3, no number is totally random)
(a higher "r" means more randomization in the cycles)
r1_p1.txt
r2_p2.txt
r3_p3.txt

Perfect cycles + 1:1 substitutes:
(the number behind the "s" stands for the amount of medium-high count 1:1 substitutes in the cipher)
c_s4_p1.txt
c_s4_p2.txt
c_s4_p3.txt

Randomization of cycles + 1:1 substitutes:
(the number behind the "r" in the file format stands for the chance of randomization to occur, 1=0.1, 2=0.2, 3=0.3, no number is totally random)
(the number behind the "s" stands for the amount of medium-high count 1:1 substitutes in the cipher)
r1_s4_p1.txt
r2_s4_p2.txt
r3_s4_p3.txt

Columnar transposition after encoding:
Perfect cycles + 1 high count 1:1 substitute: c_s1_ct_p1.txt
Randomized cycles (0.2): r2_ct_p2.txt
Randomized cycles (0.1) + 1 high count 1:1 substitute + more orderly columnar transposition: r1_s1_ct_p3.txt

Half split perfect cycles + fully randomized cycles: (1st batch of mystery ciphers)
c_r_p1.txt (previously m1p1)
r_c_p1.txt (previously m2p1)

Polyalphabetic ciphers:
Perfect cycles + 1 high count 1:1 substitute + 15% chance of assigning a random symbol: c_s1_pi15_p1.txt

Mystery ciphers:
(the idea here is for the contributors to adapt their systems to find out and identify what is going on with the cycles, knowledge gained can then possibly be related to the 340)
m3p1.txt
m4p1.txt (codename cascade)
m5p1.txt
m6p28.txt
m7p77.txt

Smokie treats ciphers:
Perfect cycles + 3 high count 1:1 substitutes + 4 medium-high count wildcards/polyalphabetic symbols: smokie.txt
Mixed cycles + 3 medium-high count wildcards/polyalphabetic symbols: smokie2.txt
Mixed cycles + 1 medium count 1:1 substitute + some small diverse interruptions: smokie3.txt
Mixed cycles + 4 polyalphabetic symbols (cycle swap) + a few wildcards symbols: smokie4.txt
Polyalphabetic, one key for uneven rows and another key for even rows: smokie5.txt
Polyalphabetic, different keys for odd/even positions: smokie6.txt
Polyalphabetic, 3 keys (interval parts): smokie7.txt
Polyalphabetic, different keys for odd/even positions with added masking of the scheme: smokie8.txt
Partial columnar un-transposition: smokie9.txt
Partial columnar un-transposition: smokie10.txt
Partial columnar un-transposition including misalignments: smokie11.txt
Partial columnar un-transposition including misalignments: smokie12.txt
Partial columnar un-transposition including misalignments: smokie13.txt
Partial columnar un-transposition including misalignments: smokie14.txt
Partial columnar un-transposition including misalignments: smokie15.txt
Normal cipher with only 35 unique symbols: smokie16a.txt
Message randomly distributed over 55 different "period 19" lines: smokie17.txt
Bigram repeat period 19 emulation attempt without transposition: smokie18a.txt
Bigram repeat period 19 emulation attempt without transposition: smokie18b.txt

AZdecrypt

Posted : August 2, 2015 4:42 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Topic starter

By the way, smokie, I’m not asking you to anaylze all the new ciphers that I made available. It’s just, they are there if you need them. When you have time I’d like you to check out the 2 new mystery ciphers, tyvm!

1. What is the probability of this being a Zodiac made cycle and not a random cycle? Can we merge them together into one symbol? Why or why not?

2. Considering the start position of 6 and end position of 325, is there any way to use plaintext frequencies to guess at what plaintext 6 and 37 map to, with the hypothetical assumption that there are no other symbols in the cycle?

1) I think it’s quite possible for it to be a real 2 symbol cycle. Merging symbols at this stage may be a bit too early I believe. If we can apply a significant amount of merges (or maybe combinations and permutations of a top list) to one of the perfectly cycling ciphers. And then be able to solve it with one of the solvers, meaning that the merges would have to be correct more or less. If we can get that far then we could start with the 340 (my view).

2) We could use English or Zodiac letter frequencies and look at the top correlations for that particular count, certainly a possibility.

AZdecrypt

Posted : August 2, 2015 4:53 pm

daikon

(@daikon)

Posts: 179

Estimable Member

This was a great idea, Jarlve!
I believe I found a way to identify, and even quantify somewhat, whether the homophone substitutions were sequential (i.e. perfect cycles), or random. I need to run a few more tests to make sure I’m not suffering from confirmation bias :), but it looks promising. I’ll explain everything in detail later today, or tomorrow. According to my method, Mystery Cipher #1 tests weakly for random cycles, i.e. it likely has small-to-medium amount of randomization in homophone assignments. Mystery Cipher #2 tests strongly for random cycles, i.e. it has high amount of randomization in homophone assignments. Please confirm if I got it right? And why there is no mystery cipher with perfect cycles? 🙂

Posted : August 2, 2015 11:38 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Topic starter

Thanks daikon,

I’m not releasing any information until smokie has taken a look at it, or the other way around if it’ll come to that. All the contributers should have an unspoiled look at it. Looking forward to seeing your method, I devised a simple counting method which can give a cheap and rough estimate of cycling. Though it can be tricked with high count 1:1 substitutes. I included this measurement (sum of non-repeats) with each cipher file.

AZdecrypt

Posted : August 3, 2015 12:29 am

daikon

(@daikon)

Posts: 179

Estimable Member

I tried running the test ciphers from this thread through my solver (that is based on simulated annealing algorithm) and noticed an interesting pattern. It seems that ciphers with random cycles are harder to solve. In the sense that it takes more restarts for the solver to converge on the correct solution. Not by much, but while the same cipher with perfect cycles gets solved every time, my auto-solver gets stuck on an incorrect solution about every other time in case of random cycles. I always thought that it only depended on the plaintext (i.e. usage of rare words, or unusual combinations of words, etc.) and the multiplicity (i.e. the number of unique symbols used in the cipher). Well, here we have the same plaintext encrypted using the same number of symbols, so clearly it also depends on how homophones are assigned.

I never noticed it myself, so it got me thinking. I’m not exactly sure why random assignment of homophones makes it harder to solve. My guess would be that there less "order" in the cipher for the auto-solver to rely on, there are fewer crumbs to follow to get to the correct solution, with homophones being more random. So I checked the bigram repeats, which can be used to estimate the "randomness" of the cipher. Totally random ciphers (i.e. a random string of letters encoded with homophones) have very little bigram repeats, and a valid cipher with column transpositions as well, since it mixes up the letters, etc.. Interestingly enough, ciphers with perfect cycles have lower bigram repeats. So, at least from the point of bigram repeats, they appear to be more random. And yet they are easier to solve. Ciphers with random cycles have higher bigram repeats, so they appear to be less random by the bigram repeats metric, and yet they are harder to auto-solve. Generally, the more order there is in the cipher (= more bigram repeats), the easier it is to solve. Look at Z408, it has a huge number of bigram repeats, many trigram repeats, and even a few 4-gram repeats. And it was solved with a pen and paper. There is a seeming contradiction here: more bigram repeats should be a sign of an easy cipher, yet when we compare perfect cycles vs. random cycles the opposite is true. It got me thinking, and I realized the reason must be that there is still some order in the perfect cycles. It’s just a hidden order, that the bigram repeats don’t reveal, and yet the auto-solver can "sense" that order and uses it to get to the solution. So, how do we tease that order out of the cipher? And if we can measure it somehow, maybe we can tell ciphers with perfect cycles from random cycles.

Here’s the method I propose to identify whether sequential homophone assignment was used, or if the cycles were randomized, and to what degree. I call it "normalized variance of same symbol distances". The idea is pretty simple, but constructing a strong enough metric out of the data proved to be more difficult.

This is the idea: if you look at what happens when the plaintext is encrypted with perfect cycles (i.e. homophones are assigned strictly sequentially), you’ll see that it spreads unique symbols relatively far apart in the ciphertext. Observe the following (devious) plaintext: "MAMADADAGAGAPAPA". As you can see, the distance between the letter A and the next letter A is always 2. Let’s encrypt it using homophones ‘1234’ for the letter A and I’ll use ‘—’ for all other letters for clarity. If you assign homophones sequentially, you’ll get: "—1—2—3—4—1—2—3—4". As you can see, now all the distances between the same letters (from ‘1’ to ‘1’, from ‘2’ to ‘2’, etc.) became 8! If you randomize assignments, you might get something like: "—4—2—4—1—2—3—3—1". Now you have distances of 4 (for ‘4’), 6 (for ‘2’), 8 (for ‘1’) and 2 (for ‘3’). If you average all distances, you get 5.

Great, you’ll say, we are done! Just compute all distances for all symbols, average it all, and the bigger the number you get, the stronger the signal is for perfect cycles, right? Not quite. Observe the following (even more devious) plaintext: "MUAHAHAHAHAHAHAHA". Let’s say we decided for whatever reason not to use homophones for the letter H, but encode is using a single symbol (1:1 substitution). Jarlve already probably knows where I’m going with that :). Now, it doesn’t matter how we assign homophones for the letter ‘A’, ‘H’ will always have a distance of 2, which means when we average the distances, it will strongly "pull" the final result towards low numbers. Basically, any 1:1 assignments will spoil the test, especially when you have a likely high frequency 1:1 assignment like ‘+’ in Z340.

I believe Jarlve counting of non-repeats method is based on the same idea (it represents the distance between the same symbols in ciphertext), and I suspect it also suffers from the same 1:1-substitutions-messing-things-up issue. Webtoy also computes the symbol distances table, by it omits the important "distance to itself" metric for some reason.

I’ve done a few tests with several test ciphers, and I confirmed my suspicion. Using the average symbol distance is not a good signal. It’s just to inconsistent over different cipher. It does work if you take the same plaintext and compare different encryption schemes. For example, it can differentiate the test ciphers in this thread, but just barely. You get numbers like 53 for perfect cycles, 50 for random cycles, 57 for perfect cycles + 1:1 and 51 for random cycles + 1:1. So higher numbers (larger distances between same symbols) point towards perfect cycles. Which proves our original observation, that perfect cycles spread unique symbols further apart in the cipher. However, these numbers are just too close together to be a reliable signal, and if you try different ciphers (with different plaintexts), you get a wide range of numbers, from 30 to 60. Can’t really tell anything. Probably too dependent on the distances of letters in the plaintext. I was stuck at this point for a while.

Then it hit me. We need to normalize the data somehow. So that instead of absolute numbers we will be talking about relative numbers. But relative to what? I’ve looked at couple of ideas, until I realized this. When perfect cycles are used, not only it spreads symbols further apart, but it also spreads them more *evenly*. For example, let’s say you have ‘EE’ in your plaintext. If you have perfect cycles, it’ll always be encoded as different symbols. But if you use random cycles, it’s possible that by chance you’ll encode both E’s with the same symbol. So in case of perfect cycles, you are less likely to have very few small distances (most of which will be 1:1 substitutions). But in case of random cycles, you’ll have a mix of both relatively small and relatively large distances. No big surprise here, random cycles result in more random symbol distances.

Someone who knows statistics more than me can probably (pun intended 🙂 offer a better metric for "evenness" of a sequence of numbers, but I propose to use variance. A variance of a set (or a sequence) of numbers is calculated by calculating the average number (I believe it is called probability mass?), and then adding up squared difference for each number and that average. It’s kind of like having a bunch of points in space, computing the center point, and then finding out the average distance from all points to that center. So now, instead of an absolute measure, we have a relative metric. It is a metric of a group of numbers relative to each other.

But enough theory, let’s look at the actual results. For each symbol in the cipher, I measured the distances between their occurrences in the cipher, then computed the variance of the distances. I dropped all symbols that repeated less than 5 times, since you need at least 4 numbers to get a reliable variance, and you need 5 occurrences to get 4 distances between them. It also filters out 1:1 substitutions as there will be just a few corresponding symbols in the cipher. Then I computed the median of the variances across all remaining symbols. Median is better than taking an average when you have a wide range of numbers. It further helps filter out high frequency potential 1:1 substitutions (‘+’ in Z340) on the result, as they’ll have the biggest variance.

First, let’s analyse test ciphers from this thread. I looked at P3 first (third one in each group), as it had the most randomized cycles.

P3 perfect cycles:
Median variance: 362.1, average symbol distance: 52.7, symbols: 47

P3 random cycles:
Median variance: 775.9, average symbol distance: 50.0, symbols: 40

Very high correlation between low variance = perfect cycles. As discussed above, also, high average symbol distance = perfect cycles, but it’s a much weaker correlation.

P3 perfect cycles + 1:1:
Median variance: 352.7, average symbol distance: 57.0, symbols: 38

P3 random cycles + 1:1:
Median variance: 705.4, average symbol distance: 51.4, symbols: 36

Again, low variance (below 400) = perfect cycles, variance above 600 = random cycles. Number of observed symbols reduced due to introduction of 1:1 substitutions, which end up being present fewer than 5 times in the cipher, so they are filtered out.

P1 perfect cycles:
Median variance: 367.6, average symbol distance: 53.3, symbols: 47

P1 random cycles:
Median variance: 544.1, average symbol distance: 52.7, symbols: 46

P1 perfect cycles + 1:1:
Median variance: 284.4, average symbol distance: 49.4, symbols: 35

P1 random cycles + 1:1:
Median variance: 496.6, average symbol distance: 53.6, symbols: 37

Same observations for P1 cipher, except it’s random cycles were less randomized, so they show smaller variance increase when going to random cycles, but still high enough to tell it apart from perfect cycles. Average symbol distance fails us here, as it shows a low number for "P1 perfect cycles + 1:1", when it should be a high number, which further proves that it’s an unreliable/weak signal.

Then I looked at Z480 without the random filler at the end and I got: Median variance: 647.5, average symbol distance: 50.0, symbols: 49. Which was surprising. It was testing strongly for a random cycles cipher (high variance). Then I remembered that Z followed mostly perfect cycles only in the first 2 parts (with a few errors), and then switched to practically random assignments in the 3rd part. So I tested the first 340 symbols of Z408 and the last 340 symbols (again, excluding the random filler) separately.

Z408, first 300 symbols:
Median variance: 286.0, average symbol distance: 45.0, symbols: 38

Z408, last 300 symbols, sans filler:
Median variance: 568.7, average symbol distance: 42.7, symbols: 36

Now it all fits! Beginning of Z408 tests strongly for perfect cycles (low variance), end of Z408 tests strongly for random cycles (high variance). My method seems to be working!

Then I tested random ciphers, just to see what kind of variance we can see in random data:

Random-340, perfect cycles (random string of 340 letters, encrypted using 63 symbols)
Median variance: 723.6, average symbol distance: 49.0, symbols: 41

Random-340, random cycles (random string of 340 letters, encrypted using 63 symbols)
Median variance: 1266.8, average symbol distance: 46.1, symbols: 42

As you can see, the random nature of plaintext skews the result too much. The all random cipher shows variance complete off the scale, as expected. But at least now we have a good measure of what to expect if we are looking at a random cipher.

Next I tested a homophonic substitution cipher (Z408 plaintext) with perfect cycles, and a random transposition applied over 17 columns (i.e. letters were rearranged the same way for each separate row), before and after encryption. I got very interesting results. I even tried 3 different transpositions (all random), to make sure the numbers are consistent. I was expecting variance close to Random-340, since rearranging/shuffling letters should randomize cycles pretty well. Instead, I got:

17-column transposition before encryption #1:
Median variance: 338.1, average symbol distance: 52.5, symbols: 44

17-column transposition before encryption #2:
Median variance: 318.5, average symbol distance: 52.3, symbols: 44

17-column transposition before encryption #3:
Median variance: 319.9, average symbol distance: 52.2, symbols: 44

17-column transposition after encryption #1:
Median variance: 400.8, average symbol distance: 52.2, symbols: 44

17-column transposition after encryption #2:
Median variance: 419.5, average symbol distance: 52.2, symbols: 44

17-column transposition after encryption #3:
Median variance: 436.4, average symbol distance: 52.4, symbols: 44

Which means transpositions don’t affect variance almost at all! Applying transposition after encryption affects it a bit, since it randomizes the cycles a bit. If you think about it, it actually makes sense. Transposition only shuffles/randomizes symbols in each row. But intra-row relationships and symbol distances between rows are much less affected/randomized.

Then I tested Mystery Ciphers from this thread.

M1:
Median variance: 627.5, average symbol distance: 47.0, symbols: 40

M2:
Median variance: 818.7, average symbol distance: 43.2, symbols: 33

My conclusions: M1 has medium randomization of cycles and likely only a few 1:1 substitutions. M2 has high randomization of cycles with likely a good number of 1:1 substitutions. I’m less sure about 1:1 substitutions (based on the number of symbols that were not filtered out, which can be affected by randomization of cycles), but I have high confidence that cycles are random in both M1 and M2 ciphers.

Finally, let’s test the guest of honor. 🙂

Z340, sans signature:
Median variance: 657.1, average symbol distance: 47.1, symbols: 34

What does this mean? First off, it shows that Z340 is not a totally random string of symbols. That’s the good news. The bad news, it tests strongly for random cycles. Somewhere between P1 and P3 (i.e. more random than P1, but less random than P3). However, the "degree of randomness" could be affected by the plaintext being less or more repetitive, so you can’t be too sure about it. But the homophone cycles are definitely pretty random. Now the million dollar question is — what randomized the cycles? I was working on the theory that it was due to a transposition. Now that I saw variance numbers for several random 17-column transpositions, I’m not so sure. The variance of Z340 is much bigger. Back to the drawing board.

I’m too tired to edit this right now, so I’m going to post everything with all the typos intact, please excuse the mess. I’ll re-read and fix everything tomorrow…

Posted : August 3, 2015 12:37 pm

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

K, give me a little time to work on M1 and M2. Thanks.

Posted : August 3, 2015 1:54 pm

masootz

(@masootz)

Posts: 415

Reputable Member

daikon – excellent work. i have a question about random vs perfect cycles. you mentioned "if you use random cycles, it’s possible that by chance you’ll encode both E’s with the same symbol". is it more complicated if they’re not truly either? if zodiac started with perfect cycles but then substituted to avoid anything obvious, it wouldn’t truly be random (thus you’d expect not to see some adjacent letters encoded with the same symbol). could that cause issues with finding the cycles or, no, anything that isn’t perfect is random even if it’s manipulated to remove clues?

Posted : August 3, 2015 5:55 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Topic starter

Take all the time you need smokie.

It got me thinking, and I realized the reason must be that there is still some order in the perfect cycles. It’s just a hidden order, that the bigram repeats don’t reveal, and yet the auto-solver can "sense" that order and uses it to get to the solution. So, how do we tease that order out of the cipher? And if we can measure it somehow, maybe we can tell ciphers with perfect cycles from random cycles.

It may be solver specific but I can’t say since I haven’t tested these ciphers with AZdecrypt. If true then I believe it could be due to the symbols having a more uniform spread, since that promotes more overall change. It could also be that it’s because the cipher has more variance because the randomization I added just selects a symbol at random every time which promotes more uneven counts in the cycles.

I believe Jarlve counting of non-repeats method is based on the same idea (it represents the distance between the same symbols in ciphertext), and I suspect it also suffers from the same 1:1-substitutions-messing-things-up issue.

Yes, and I noted that it could be tricked by 1:1 substitutes in my previous post. But on the other hand could also be used to identify them by simply removing them and see if the normalized score improve.

Then it hit me. We need to normalize the data somehow. So that instead of absolute numbers we will be talking about relative numbers. But relative to what? I’ve looked at couple of ideas, until I realized this. When perfect cycles are used, not only it spreads symbols further apart, but it also spreads them more *evenly*. For example, let’s say you have ‘EE’ in your plaintext. If you have perfect cycles, it’ll always be encoded as different symbols. But if you use random cycles, it’s possible that by chance you’ll encode both E’s with the same symbol. So in case of perfect cycles, you are less likely to have very few small distances (most of which will be 1:1 substitutions). But in case of random cycles, you’ll have a mix of both relatively small and relatively large distances. No big surprise here, random cycles result in more random symbol distances.

Yes, it will be spread out more evenly. If you remember, it’s something that I said in the previous thread. I have another measurement system in place that seems to be a bit similar to yours. For each symbol: abs((sum_of_symbol_positions/symbol_count)-(cipher_length/2))*log(symbol_count+3). It may require a bit of tweaking still. I think your system may be a bit more thorough.

I’m glad to see that other people are also coming up with their own measurement systems to find out things about the ciphers, it’s a very exciting thing to do!

Number of observed symbols reduced due to introduction of 1:1 substitutions, which end up being present fewer than 5 times in the cipher, so they are filtered out.

No, they are all medium-high count 1:1 subtitutes, it’s just there are 4 of these in that cipher type. Because I did not assign homophones to these symbols I had to assign more homophones to other symbols resulting in more low count symbols.

Which means transpositions don’t affect variance almost at all! Applying transposition after encryption affects it a bit, since it randomizes the cycles a bit. If you think about it, it actually makes sense. Transposition only shuffles/randomizes symbols in each row. But intra-row relationships and symbol distances between rows are much less affected/randomized.

This is something I noted and documented as well in the previous thread. It’s actually transposition schemes that are mainly horizontal (or subtle) that do not upset our measurement systems as much. Also because we try to capture the whole cipher in a single number. smokie treats actually made a find that shows there are things which change (versus expected) with columnar transposition if you compare various parts of the cipher. That also added more evidence towards the 340 not being columnar transposition after encoding. A previous test I came up with is to score and average many different columnar transpositions and then compare this value with the actual one. There should be a somewhat static relation (factor) between the averaged number and the actual one. I dug it up:

I devised some kind of sketchy test to see if columnar transposition – or possibly any other – was done after symbols assignment (probably more accurate for cyclic h.s.) where I average data from the non-repeats (unique strings) for a large number of random columnar transpositions. I found that then multiplying this average by 1.05 gives the likely score (approximation) the cipher would have, for the 408 this is: 4683.5/4692. Where the bold number is the approximation. For the 340: 4457/4462. I think it could mean that the cipher was not columnary transposed after the symbols were assigned as I have thought earlier since then the number of the non-repeats would likely have shifted away and the approximation would not match. I’m less sure of the whole columnar transposition thing atm.

Since part of this thread is to find out what is possibly going on with the cycles in the 340 (possibly to start merging homophones at one point) and that transposition after encoding is possible I find it very important to be able to somewhat rule out various transposition schemes that fit what we are looking for (mainly horizontal).

It seems that by your measurement system the 340 seems to correlate more towards the m1p1. I also see that allot of symbols were dropped due to counts being fewer than 5, which in turn may be a hint towards some wildcards/1:1 subtitutes being actual in the cipher. Thank you very much for comparing all these ciphers by the way.

Here is the observation of smokie in relation to columnar transposition (after encoding). It may just be that some columnar transpositions may not display this effect but I think in general it does shift the cycles around awkwardly and that it should be measureable.

3. Except when you look at the Mystery Cipher, which inexplicably has more cycles in the second half than the first half. Jarlve made the Mystery Cipher perfectly cyclic and included only one high count symbol, 17, with count of 24.

Now the million dollar question is — what randomized the cycles? I was working on the theory that it was due to a transposition. Now that I saw variance numbers for several random 17-column transpositions, I’m not so sure.

Yes, it are questions like these that this thread, ongoing work, tries to answer. I’ve gone through the same thought process as you, coming up with almost similar systems and observations. Maybe we are just equally mad, or in the same boat (I think you are a bit faster though). It’s pretty funny. When I came up with several of measurement systems I concluded that some transposition had to be going on because I was expecting "less variance’" (in your term). I tried so many things, decided for a while on columnar transposition. Thanks to smokie I have found out that most of the "variance", for my measurement system anyway, was due to medium-high count some symbols not cycling well (possibly wildcards/1:1 substitutes/or not part of the cipher). These symbols are number 5, 19, 20 and 51 by appearance, remove them and see if things improve.

AZdecrypt

Posted : August 3, 2015 7:37 pm

daikon

(@daikon)

Posts: 179

Estimable Member

daikon – excellent work. i have a question about random vs perfect cycles. you mentioned "if you use random cycles, it’s possible that by chance you’ll encode both E’s with the same symbol". is it more complicated if they’re not truly either? if zodiac started with perfect cycles but then substituted to avoid anything obvious, it wouldn’t truly be random (thus you’d expect not to see some adjacent letters encoded with the same symbol). could that cause issues with finding the cycles or, no, anything that isn’t perfect is random even if it’s manipulated to remove clues?

Well, what is random is actually a quite complication question. Technically, the string "AAAAA" could be a string of random letters, if you are lucky enough (or unlucky?) to randomly choose the same letter 5 times in a row. Conversely, if you ask someone to write a string of random letters, what they would write will likely be not at all random. In the sense that an average person would try to avoid putting the same letters next to each other, and will probably go as far as trying to spread the same letters as far as possible. When in fact it would be quite opposite of truly random. Without using a computer, about the only reliable way to generate truly random numbers is to use some random physical process, like flipping a coin, or rolling dice.

So you are probably correct, if Z was for some reason using random cycles, he would still try to avoid encoding doubled letters to the same symbols in the cipher, even though it would technically not be random any more. But the thing is, I see no reason why Z would randomize cycles on purpose. Z340 was likely an incremental improvement on the encryption method used in Z408. And that improvement was likely based on how Z408 was cracked. I.e. he learned how Hardens solved Z408, and probably thought: "Ok, what can I do to make sure my next cipher can’t be cracked the same way?". As we know, Z408 was solved using the crib "kill". So his main goal would be to prevent using cribs to solve his next cipher. How? One obvious way is to stop using the word "kill". But then someone might use a different crib, and Z obviously had no way of knowing what it could be. So he probably did something that would prevent anyone from plugging any words at all. My guess would be that he rearranged letters in the plaintext somehow, or mixed them up. Which would have a similar effect on the homophone cycles, randomizing them as well. So the signs of random cycles that we are seeing in Z340 is likely not his intention to randomize them on purpose, but a side effect of the extra step he did to Z340 make it harder to crack.

Or he could’ve simply been sloppy about assigning his homophones, which would randomize the cycles somewhat, which is what likely happened in the 3rd part of Z408…

Posted : August 4, 2015 4:50 am

daikon

(@daikon)

Posts: 179

Estimable Member

I’m glad to see that other people are also coming up with their own measurement systems to find out things about the ciphers, it’s a very exciting thing to do!

It was definitely a great idea to start this thread! We need to learn more about different types of encryptions, to find which one Z340 behaves and looks more like, and maybe it’ll help us solve it. It would be half the battle to identify the exact encryption method used. Or at least rule out encryption methods that couldn’t have been used.

Number of observed symbols reduced due to introduction of 1:1 substitutions, which end up being present fewer than 5 times in the cipher, so they are filtered out.
No, they are all medium-high count 1:1 subtitutes, it’s just there are 4 of these in that cipher type. Because I did not assign homophones to these symbols I had to assign more homophones to other symbols resulting in more low count symbols.

Ah, yes, of course you are right! You had to spread the symbols to other letters to keep it at 63 overall symbols.

Which means transpositions don’t affect variance almost at all! Applying transposition after encryption affects it a bit, since it randomizes the cycles a bit. If you think about it, it actually makes sense. Transposition only shuffles/randomizes symbols in each row. But intra-row relationships and symbol distances between rows are much less affected/randomized.

This is something I noted and documented as well in the previous thread.

I’ve thought about it some more and I have to say that at least my variance metric doesn’t rule out letter transpositions for Z340, unfortunately. In fact, it’s the opposite. Since the transpositions within the row boundaries have very little effect on the overall variance, we can’t tell if Z340’s medium-to-high variance is just the sign of random cycles, or it’s both random cycles + transposition. However, I’ve now done tests with various *columnar* transpositions (when you read the cipher along the columns, top-to-bottom, left-to-right) applied after the homophone substitutions and I’m happy to report that this type of transposition does affect the variance metric a lot, by increasing it towards nearly the "random plaintext" level. I actually did every columnar transposition of Z340 of each width between 2 and 40, in both directions (encoding and decoding), without rearranging the columns (i.e. no key), and every time the variance metric was increased by at least 50%, and more than doubling it for most of the widths. So I think we can safely rule out at least columnar transpositions after homophones. I’ve tested a bunch of other ciphers and they all behaved the same — variance goes up if you apply columnar transpostion. With the exception of 2 test ciphers where I specifically applied a transposition during encryption, after encoding homophones, and my test correctly showed a significant decrease in variance at the correct matrix width (which was used during the encryption). So I would have detected a columnar transposition if there was one in Z340. But it still doesn’t rule out columnar transpositions on the plaintext, *before* homophone substitutions, unfortunately.

I’ve gone through the same thought process as you, coming up with almost similar systems and observations. Maybe we are just equally mad, or in the same boat (I think you are a bit faster though). It’s pretty funny.

I’m pretty sure I didn’t invent anything new. Someone somewhere probably already thought about the exact same thing a long time ago. But it was new to me, and it was exciting to figure it out on my own, so I regret nothing. 🙂

These symbols are number 5, 19, 20 and 51 by appearance, remove them and see if things improve.

Using Webtoy’s ASCII representation, which is what I use for my tests, I got ‘p’, ‘+’, ‘B’ and ‘F’, correct? Removing those from Z340 does lower the variance metric, but definitely not enough to classify the remaining cipher as "perfect cycles". So either those are not 1:1s, or they are not the only medium frequency 1:1s, or they are, but the cycles are just random. Basically, can’t make any conclusions with any degree of certainty.

Posted : August 4, 2015 5:51 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

So he probably did something that would prevent anyone from plugging any words at all. My guess would be that he rearranged letters in the plaintext somehow, or mixed them up. Which would have a similar effect on the homophone cycles, randomizing them as well. So the signs of random cycles that we are seeing in Z340 is likely not his intention to randomize them on purpose, but a side effect of the extra step he did to Z340 make it harder to crack. Or he could’ve simply been sloppy about assigning his homophones, which would randomize the cycles somewhat, which is what likely happened in the 3rd part of Z408…

O.k., but if Zodiac rearranged the plaintext somehow, he could have still encoded the rearranged plaintext with perfect cycles and they would show up in the our analysis as perfect cycles. It would just be perfect cycles of symbols that map to rearranged plaintext. I don’t think that he was that smart. My scenario has him saying to himself while encoding the 340, "Ha, ha ha, I’ll show you," and simply peppering the message with a bunch of +’s, B’s, F’s and maybe q’s instead of the cycle symbols. Just like he did with the solid triangle in the 408 only more. Anybody could do that with 9 beers in their system and make the damn thing impossible to solve because every time you try to manually change a symbol assignment, half of the words that you were working on disappear. By the way I am really happy that you are thinking about the cycles. You are fast, that’s for sure. I am retooling my spreadsheet a bit each night, so if you want answers to the messages that you worked on, maybe check Jarlve’s post. I hate to keep you waiting, but will have my own answers in some time. Jarlve did a brilliant job on M1. I like how he made the distribution of symbols assignments in the key much more uniform and then somehow caused the cycles to not be perfect. The distribution of cycling is also somewhat flat across the list of symbols. I still don’t know exactly what he did, but there is an amazingly low count of perfect cycles. I would have thought that there would have been many more random cycles than that.

Posted : August 4, 2015 5:56 am

daikon

(@daikon)

Posts: 179

Estimable Member

O.k., but if Zodiac rearranged the plaintext somehow, he could have still encoded the rearranged plaintext with perfect cycles and they would show up in the our analysis as perfect cycles. It would just be perfect cycles of symbols that map to rearranged plaintext.

Not quite! I actually thought of the same thing. That’s why I also tested 2 special "random" ciphers. One was a random string of letters, that was subsequently encoded using perfect cycles. And the other one was just pure mess of random symbols (which would be the same as random plaintext encoded using random cycles). If you look back at my original message about the variance metric, you’ll see that the first random cipher tested strongly for random cycles (even though in fact it was encoded using perfect cycles). So at least my variance metric can’t tell randomness in the plaintext from randomness in the cycles. Going back to Z doing something with the plaintext before homophonic substitutions — if he did a good job at that extra step, the plaintext would look and behave closely to random text (which is what a good encryption scheme does — it makes the result look totally random). And Z340’s relatively high variance shows just that — relatively high degree of randomness. But from what — random cycles, or induced randomness of underlying plaintext — I can’t tell yet.

By the way, I now tested variance of different parts of Z340 (i.e. beginning, middle, end), and I even ran a "sliding window" across Z340, where I selected 200 sequential symbols, starting with 1st symbol in the cipher, 2nd, 3rd and so forth, and I’m seeing a fairly even level of variance throughout. Unfortunately you can’t compute variance of a smaller section, such as individual rows, as you need to have a long enough string to get a good representation of distances between the same symbols. But I think it is fairly safe to say that there are no sudden changes in the way different parts of Z340 were encoded. Unlike Z408, which has a distinct increase of variance (almost twice!) towards the end, when by the 3rd section Z abandoned his initial attempt at maintaining perfect cycles, either by design, or because he got tired and sloppy.

Which I think can also rule out that different parts (halves?) of Z340 were encrypted using different methods or different homophones, which I’ve seen suggested before. I’m not 100% sure about it yet, as I need to do more tests to confirm. Or maybe the change in encryption was very granular, like for every row, as you can’t measure change in variance in such short sections.

I don’t think that he was that smart. My scenario has him saying to himself while encoding the 340, "Ha, ha ha, I’ll show you," and simply peppering the message with a bunch of +’s, B’s, F’s and maybe q’s instead of the cycle symbols.

Yes, that’s another distinct possibility! The presence of nulls, or random meaningless symbols in the ciphertext, that should be dropped. They can be specific symbols, which should be fairly simple to test — just test all permutations of removing up to N specific symbols. Or they can be symbols at specific positions in the ciphertext — remove every 4th, or 3rd, or even every other symbol and feed the rest to the auto-solvers. Has anyone tried that by the way? I’m worried that removing symbols from Z340 might bring the multiplicity to such high levels that it couldn’t be solved any more. And yet another scheme would be to insert nulls before, or after, or in the middle of every *word*. However, if done excessively, it can result in a cipher that can’t be reliably decrypted even if the key is known, because the word boundaries are not known (no spaces). For example, look at this decrypted plaintext, where I added nulls both before and after each word: "…ATIGONENORTHEEFROMETHERETOL…". Can you read it? I doubt it. Let me lowercase the nulls and insert spaces: "…At iGOn eNORTHe eFROMe tHEREt oL…" (first ‘A’ and last ‘L’ are from previous/next words) Which is what might have happened — Z came up with such a convoluted scheme that it can’t be decrypted even if you know the key. And I highly doubt that Z tested that his new cipher can be decrypted. Considering he left several errors in Z408 which he could’ve easily found out and corrected if he actually tried decrypting Z408.

I hate to keep you waiting, but will have my own answers in some time.

No worries, take your time! I’m fast because my system only detects if the cycles are perfect or random, and your system will likely identify the actual cycles.

Posted : August 4, 2015 6:49 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Topic starter

Using Webtoy’s ASCII representation, which is what I use for my tests, I got ‘p’, ‘+’, ‘B’ and ‘F’, correct?

Yes.

Z340 was likely an incremental improvement on the encryption method used in Z408. And that improvement was likely based on how Z408 was cracked. I.e. he learned how Hardens solved Z408, and probably thought: "Ok, what can I do to make sure my next cipher can’t be cracked the same way?". As we know, Z408 was solved using the crib "kill". So his main goal would be to prevent using cribs to solve his next cipher. How? One obvious way is to stop using the word "kill". But then someone might use a different crib, and Z obviously had no way of knowing what it could be. So he probably did something that would prevent anyone from plugging any words at all. My guess would be that he rearranged letters in the plaintext somehow, or mixed them up.

I agree that the 340 was probably made in response to the 408 and that he found out how it was solved and then improvised.

That’s why I also tested 2 special "random" ciphers. One was a random string of letters, that was subsequently encoded using perfect cycles. And the other one was just pure mess of random symbols (which would be the same as random plaintext encoded using random cycles). If you look back at my original message about the variance metric, you’ll see that the first random cipher tested strongly for random cycles (even though in fact it was encoded using perfect cycles).

That’s really interesting, and may be very useful. I’ll need to implement your metric in one of my programs as soon as possible to do some testing.

By the way I found something a bit interesting perhaps. Note that the last 10 rows of the 340 have only 1 bigram repeat. Now increase the period to 3 and be slightly dazzled. At the start of previous thread I discussed the relatively low bigram repeat counts of the 340 in relation to smokie’s wildcard idea. That the Zodiac possibly revised the cipher to remove/replace repeats, most likely visually and therefore would not spot bigrams at a distance easily. It’s just an idea with this strange observation.

AZdecrypt

Posted : August 4, 2015 10:16 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

By the way I found something a bit interesting perhaps. Note that the last 10 rows of the 340 have only 1 bigram repeat. Now increase the period to 3 and be slightly dazzled.

What do you mean by this? How do you increase the period to 3?

http://zodiackillerciphers.com

Posted : August 4, 2015 10:23 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Topic starter

By the way I found something a bit interesting perhaps. Note that the last 10 rows of the 340 have only 1 bigram repeat. Now increase the period to 3 and be slightly dazzled.

What do you mean by this? How do you increase the period to 3?

I used to call it bigrams at a distance but daikon has been using the term period. It’s the distance between the first and the last symbol of the bigram, a normal bigram has a distance of 1. A distance of 3 would be A..B.

Edit: added an image of the involved symbols.

AZdecrypt

Posted : August 4, 2015 10:34 pm

Zodiac Discussion Forum

Homophonic substitution