Not all homophonic substitutions can be auto-solved

daikon · 2015-07-07T10:42:16Z

... we have software that can solve the 408 in a matter of seconds. People have hammered the 340 with all this great software for years and never cracked it. This would suggest it is not homophonic substitution. I wanted to address this common misconception. It's seems there is a general assumption that any homophonic substitution can be automatically solved by currently available programs that implement hill-climb algorithms like ZKDecrypto and AZDecrypt. And I'm not talking about carefully crafted plaintext that doesn't have certain common letters (like passages from the novel "Gadsby"), or that has an abundance of rare letters, or a very high, or very low IoC. That's actually not the problem. Or plaintexts in languages other than English, as that's solvable too, obviously. But there is still a way to fool hill-climb algorithm quite easily. Here is an example cipher I've constructed to illustrate this problem. It is a straight homophonic substitution, without any tricks of any kind. Each ciphertext symbol translates to only one plaintext letter, and there are no additional transformations to the plaintext that need to be done to read it: 9V[J%:;*S#K&_/$T'_A( ]#4$B)$M$+USP<$`# W,Q5L1a#Y-$&E2$$F#G3 9$.X:*N/#$_+VW#6J#=; ^C,T0ZD$^#HI`#R7K#>Y ?-P8L^A.^*QEJ4#a#@`# U1<9#K^B+^,RFL5#a#=` #S2:;#J$C#>#3<$D#[#9 $PG#6$Q'H7RO#8M#:^A- ^B?/$#X.@(P$*Z=)]C;% IE[#+]F#^D,A&a#VG'`` _B(4K<)5L0>?1$%$26YH &QZC9'IEND3FJ@#7^A-( =#R[/8K)$#:%4L0T]G.> #_*P^1$B2U[H&$$'IE]# +[F#^C,(?#;D)b#35YA% 6J<9B&b#C'7K:G/b#$D( $)8L%HI It has a fairly normal plaintext too. In fact, it's a passage from one of the Zodiac letters I picked at random. The cipher has the exact number of unique symbols as Z340, and it's slightly longer at 347 letters, so it should be actually easier to solve. Yet, neither ZKDecrypto nor AZDecrypt can crack it at default settings. I've used the highest setting of "keys per cipher" for AZDecrypt and I let ZKDecrypto run for nearly an hour. I can see that it gets close at times, and I can even find a few correct words in the jumble of letters, but I can only see them because I know the plaintext. It is still generally unreadable otherwise (without knowing the plaintext). It is by no means a clever way of constructing a cipher, and it should actually be fairly clear how I did it after manual analysis, but it does appear to fool automated cracking attempts. I'll post a hint tomorrow, if nobody solves it before then. EDIT: It appears I misunderstood some of the options of ZKDecrypto, and how to use it in general. So I stand corrected, the cipher above *can* be auto-solved. But definitely not in seconds. EDIT#2: See this cipher which proved to be truly unsolvable without manual analysis and careful merging of cycles.

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

OK, I cracked it (daikon, please confirm the solution I pm’d you), but it required many manual steps. Hint: Reduce the multiplicity before trying to crack this.

Yes, congrats, you did solve it! Well done. It’s not the cleanest solution, but close enough to read the information I was trying to convey in it. I’m quite amazed! I was actually convinced that it will never be cracked as I had to reduce the unique symbol count down so low, before ZKD/AZD were able to crack it automatically, that it was mostly a monoalphabetic substitution at that point! I’m guessing you merged repeated cycles manually to improve multiplicty, right? How many unique symbols did you end up with before it was cracked?

Posted : July 12, 2015 12:20 am

doranchak

(@doranchak)

Posts: 2614

Member Admin

I got it down to 49 by identifying two really strong cycles and merging their constituent symbols. It was enough to get one of the solvers to produce a few small bits and pieces of recognizable text. I then saw what you were going for in the plain text and did a manual solve for the rest.

I identified four cycles in total but was afraid of false positives.

I think if this cipher did not have the cycles in it, the solution would be out of reach (hopefully only temporarily). You should encode another one with the cycles randomized.

http://zodiackillerciphers.com

Posted : July 12, 2015 12:27 am

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

I think you tried to induce high multiplicity by assigning some high counts to a few symbols and low counts to the bulk.

That’s not quite it, but you are on the right track. The classical way of creating homophonic substitution ciphers is to assign many ciphertext symbols to high frequency letters in the plaintext so that it has the "smoothing" effect on the frequency distribution in the ciphertext. Well, Z340 has one symbol (+) with peculiarly high frequency. We still don’t know why, so I can’t draw any real conclusions from it, but I thought I’ll try to do the same thing in my cipher to just mimic the same behavior.

One way of doing it is to follow the classic method of "more ciphertext symbols for higher frequency letters", save for one somewhat frequent letter, to which we will assign just one ciphertext symbol. We obviously don’t want it to be "e" as it would be an easy guess, but the rest of frequent letters in English texts have roughly equal distribution, so we can pick one and still have enough leeway in confusing the decryptors. It turned out to be quite a stumbling block for hill-climb based auto-solvers too, but for a different reason I believe, as hill-climb algorithm (HCA) doesn’t care about frequencies in any way. I think it has to do with smoothness of the solution field. Basically, HCA expects that a small change in the inputs (i.e. changing one of the decryption key symbols) would result in a relatively small change in the fitness/score of the solution. It won’t be true any more in case we do assign just one ciphertext symbol to a high frequency letter. Because changing that symbol in the key would result in a huge change in the fitness/score of the solution, as it will affect/change many parts of the ciphertext, being repeated so many times in the ciphertext. So HCA will often jump over the correct solution without realizing it, will backtrack, and will never approach the real top of the hill. It will eventually find the top, I believe, but it might take a lot of tries, close to brute force iterating through all possible keys.

Now, I have no idea if that’s what Z was doing with the "+", or how did he know about HCA back in the ’60s, but it’s possible that he just stumbled on something that’s hard to crack using HCA by pure chance. Or it could be that Z used "+" for something entirely different, or that it can’t be cracked because it’s not a straight substitution cipher.

Posted : July 12, 2015 12:43 am

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

So, in this one, the + symbol must be either E, T, A, O, I, or N.

It’s the last one, I didn’t want to make it too easy. 🙂

Just to add, this is honestly the strangest cipher I’ve ever seen. If it turns out to be fairly normal English text, I bow down to you.

Well, that was my thinking process. In what way can plaintext be "special", but still be an English text and something that Z could possibly write? I’ll give you a hint, I took inspiration from Z32 "map cipher" as he hinted that it "concerns radians & inches along the radians".

Posted : July 12, 2015 12:50 am

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

Maybe the repeats/patterns can be explained by the plaintext being very repetitive. The same thing repeating over and over again, or use of a very limited vocabulary.

You’ve hit the head on the nail! 🙂 It’s not the same thing repeated over and over, but the vocabulary is very limited indeed. You’d think that it would make it easy to crack automatically, but it seems to be the other way around.

Posted : July 12, 2015 12:52 am

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

I think if this cipher did not have the cycles in it, the solution would be out of reach (hopefully only temporarily). You should encode another one with the cycles randomized. 🙂

Yes, you are correct, randomizing cycles would definitely help. But it would only make the manual solving more difficult. Auto-solvers don’t use cycles in any way, as best as I can tell. So it would have no effect for them. And the whole point of this exercise was to show that there are cases of homophonic substitution ciphers that can’t be cracked automatically. Not by current hill-climb based algorithms anyway. Whether Z340 is such a special case of a straight homophonic substitution cipher, or it uses some extra steps to obfuscate plaintext, it still remains to be seen. I just wanted to hopefully spur further improvements to the automatic solvers, as the current prevailing misconception seems to be that they are already powerful enough to solve any homophonic substitution cipher.

Posted : July 12, 2015 1:04 am

glurk

(@glurk)

Posts: 756

Prominent Member

I’ve got this now, I guess I was close earlier after seemingly finding EIGHT, HUNDRED, FIFTY, etc… I should have stuck with it.

I didn’t know if I was on the right track or not, after a while your eyes start to glaze over, LOL. :lol:

Daikon, can you PM me the exact plaintext? Trying to get this all correct as you intended is horrible! I want to try an experiment or two using the exact solution.

-glurk

PS. I like the bookends.

EDIT: I actually think I have it exact now. I’ve sent you a PM.

——————————–
I don’t believe in monsters.

Posted : July 12, 2015 11:45 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Well done doranchak and glurk!

daikon,

As I’ve said before I believe the limited vocabulary artificially raise the multiplicity. So it’s a multiplicity problem more than anything else. I agree that some ciphers may be unsolveable due to the nature of the plaintext and multiplicity of the cipher. It’s fair to assume that such a scheme is not actual in the 340 because there are not much patterns and it scores much lower than your 3rd cipher in both of the solvers.

To answer your question from the other thread, I’m planning to release the next version of my solver near the end of the month. It will have 3 modules for now, the old solver, a new 4-gram solver and a 5-gram solver. I’m spending most of my time optimizing the 5-gram solver to improve recovery rate for various ciphers, yours included.

Edit: and btw, I’m really thankful for your ciphers and work on this. It’s something I needed.

AZdecrypt

Posted : July 12, 2015 1:19 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

EDIT: I actually think I have it exact now. I’ve sent you a PM.

You do have it *exact* to the letter. Wow! I had to abbreviate degrees/minutes/seconds to just one letter, as the message was already getting too long and I wanted to fit everything in. I figured one single letter every few words shouldn’t complicate things too much for auto-solvers, and Zodiac did shorten his words on multiple occasions.

I think no need to post the plaintext here any more? In case someone else comes across it later on and wants to solve it independently. I’ve given so many "clews" in this thread anyway, it’s practically solves itself now. 🙂

Posted : July 12, 2015 9:39 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

I’ve got this now, I guess I was close earlier after seemingly finding EIGHT, HUNDRED, FIFTY, etc… I should have stuck with it.

So, how did you solve it, I’m curious? Similar to doranchak, by merging repeating cycles, i.e. guessing homophones that solve to the same plaintext letter?

Posted : July 12, 2015 9:43 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

As I’ve said before I believe the limited vocabulary artificially raise the multiplicity. So it’s a multiplicity problem more than anything else.

You probably mean something like "word entropy", right? Since multiplicity (length of cipher divided by number of unique ciphertext symbols) of my last cipher is actually better than Z340 and almost as high as Z408 (if you drop the last 18-symbol filler at the end of Z408). My plaintext also uses 23 letters out of 26, same as Z408. IoC (0.075, I think) is a tad high but still within the expected range for English. I think the main reason why my latest cipher wasn’t auto-solved is because the corpus used to build 4-grams/5-grams stats doesn’t represent my plaintext too well. This issue was already demonstrated by my previous Zodiac signs cipher. Once you updated AZD to use 5-grams from a wider corpus, it is now solved quite easily, whereas the currently available version of ZKD/AZD still can’t solve it. I believe if the 4-grams/5-grams are updated to represent a good variety of numerals spelled out as words, my latest "unsolvable" cipher will fall to automatic cracking quite easily too.

I agree that some ciphers may be unsolveable due to the nature of the plaintext and multiplicity of the cipher. It’s fair to assume that such a scheme is not actual in the 340 because there are not much patterns and it scores much lower than your 3rd cipher in both of the solvers.

Yes, you are correct of course. Z340 is sufficiently different from my cipher to rule out that its plaintext is similar. However, what I’m worried about is this. It’s almost certain now that Z340 was manipulated in some additional ways besides homophonic substitution. Be it reversing rows, or using rail fence, transpositions, bifid encoding, what-have-you. I believe a lot of cracking attempts along this attack vector involves brute forcing all possible permutations of a given manipulation and then feeding the results to AZD/ZKD without too much manual analysis of each intermediate step. And that’s what I’m worried about — you might come across the correct extra manipulation that Z did to plaintext, but then you’ll miss it because AZD/ZKD will potentially fail to solve it. Even though the original Z340 doesn’t have a lot of patterns, it’s quite possible that those patterns will emerge if full force if you transpose columns using the key "ZODIAC", for example. And it will be missed because that intermediate result will be among hundreds of other possible transpositions that were fed to auto-solvers without any manual analysis. Maybe we need to incorporate automatic merging of repeated cycles into auto-solvers, which is how my latest "unsolvable" cipher was cracked?

Posted : July 12, 2015 10:17 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

With IoC 0.075 a reasonable automatic solve came through, though it wasn’t the highest scoring example.

Don't scroll down this box if you don't want to see the solution yet to daikon's 3rd cipher.












moriacvictedsmalla
bothirtolightsseve
ndthirtythreednort
honahundrastwentot
herallmendtwentyse
venswasthaniciathi
rtoeightsfimadinfo
rtyonldnorthenehun
dresthentotworeigh
tdthirtyaightswast
lakeharrolsdathirt
yeightsthirtothree
dfortyeightsnorthe
nahundrasthentotwo
rthirtalndfiftyfou
rswestdanfrancisco
thirtosevensfertys
emandninateandnort
honlhundresthentot
wortwentysevandtha
ntofimeswasthesiac

You are correct about the corpus not really adhering to the message, I was still asuming an IoC of around 0.0667. glurk was kind enough to send me the solution but I didn’t check the IoC.

Your worries are understandable. But with for instance, reversing of rows and many other schemes, there is allot of overlap. Meaning that the solution will be recovered multiple times. Also better solutions will score higher, and the top results can easily be checked manually. But yes, there is the risk that the solution might be overlooked. You are probably aware that I’ve done allot of work in this direction and I have thought of doing tests over again with better versions of my solver.

I’m not sure about merging patterns because there are not many of them in the 340. I feel it’s a niche.

AZdecrypt

Posted : July 12, 2015 10:55 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Topic starter

With IoC 0.075 a reasonable automatic solve came through

Hmm, interesting. It didn’t occur to me to adjust IoC for automatic solves. I couldn’t even find how to do that in ZKD when I looked just now, but I can adjust it easily in AZD. Perhaps an improvement could be made that you could specify an IoC range, and the program will randomly pick an IoC within the range for each random restart? Maybe even use normal/Gaussian distribution, so it hits mostly close to IoC but once in a while picks something relatively far from the expected IoC?

, though it wasn’t the highest scoring example.

I’ve seen that happen in ZKD when I was reducing the number of unique symbols in ciphertext — as it was iterating over the solution, I could see some of the correct words develop, but then it would blow right past and scramble everything again, with something that scored higher, but was farther away from the correct solution. I think using 5/6-grams would help with that.

I’m not sure about merging patterns because there are not many of them in the 340. I feel it’s a niche.

Yes, that’s true for "vanilla" Z340. But once you start applying transpositions/bifid/etc., patterns might show up as you get close to finding the correct extra step.

Posted : July 13, 2015 12:33 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

The IoC range is a good idea. I was thinking either to allow some adjustable sway in the IoC or to add a module that tries to solve just one cipher going over an adjustable range of IoC, for example (0.0534 -> 0.0800), scoring renormalized by IoC afterwards. I once had a secondary hill climber which figured out the IoC, so that’s another thing to consider.

For now I want to focus on maximizing results from 5-grams, probably for quite a while. But you are right.

Edit: also testing a weighed IoC model as in ZKDecrypto.
Edit 2: testing IoC target and weighed IoC model together produced superior results, will allow to adjust the strength of both. This is exactly the adjustable sway I was looking for.

AZdecrypt

Posted : July 13, 2015 11:53 am

Norse

(@norse)

Posts: 1764

Noble Member

General question to you crypto heads: Is there a term for a cipher which simply "doubles" the same basic substitution?

Example: % = 1 = A?

Now, that particular one would be pretty pointless. But what if I settled on something like this:

% = 0
# = 1
001 = A

The plaintext would be "A" and the cipher would read "%%#".

Is there a term for this sort of thing?

Posted : July 13, 2015 5:35 pm

Zodiac Discussion Forum