Cipher challenge

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

My experiments to produce Z340-like ciphers are starting to bear fruit.

Here’s an example cipher – I challenge anyone to try to break it. It is a standard homophonic substitution cipher, read and enciphered in the normal reading direction.

9%(1UyKbjJ.#5BR+3
+28@TSp1l-^NBtHER
+B|JLY8OzFR(4>bl*
VLk+FU2)^RJ/c5.DO
zB(WH8MNR+|c+.cO6
|5FU+<+RJ|*b.cVOL
|5FBc)T(ZU+7XzR+k
>+lpyV)D|(#kcNz):
68Vp%CK-*<WqC2#pc
-Ff2B9+>;ZlCP^BU-
7tLRd|D5.p9O)*ZM6
Bctz:&yVOp%<K+>^C
FqNLPp*-WfzZ2d7;k
l<S^+/|dT9f4YK+WG
j4EyM+WAlH#+VB+L<
z|4&+OkNpB1V2Ff/)
z+Mp_*(;KSp2(TGO+
FBcMSEG3dWKc.4_G5
pDCE4GyTY+_BAdP2p
|+tFMPHYGK+F6pX^2

What makes these generated ciphers unique is that they share very many qualities with the actual Z340:

Very similar symbol distribution
Same number of repeated bigrams/trigrams
Appearance of pivots
Top 20 L=2 homophone cycles have the same "strength" as the top 20 from Z340
Same "prime phobia" feature of + and B
Simulated transcription errors (Up to 20 polyphones where the alternate symbol assignments come from set of visually similar symbols)
Same number of repeated ngrams when even (or odd) positions removed.
Same absence of repeated symbols on lines 1-3, 11-13
Matches Jarlve’s non-repeat measurement of the Z340 (4462)
Same appearance of word-like symbol sequences ("zodiak", "her", "god")
Same phenomenon of a trigram repeating in the same column, and that intersects with another repeating trigram
Similar appearance of "box corner" patterns
Similar appearance of "fold marks" on line 10
Misspellings and missing words in the plaintext
There is a section of filler

[/list:u:336k7txn]

Here’s a visual of some of the patterns in the generated cipher that are similar to Z340:

The hint for the plaintext is that it comes from a Gilbert and Sullivan play.

I fed the cipher into zkdecrypto and azdecrypt. They each were able to recover around 60% to 65% of the plaintext.

A 10-symbol section of the cipher is filler. And there are 20 intentional transcription errors. So, that contributes to 9% of the cipher text being messed up. On top of that there are several intentional misspellings. These factors are probably making it difficult to automatically recover more of the plaintext.

Another hint: Because of the poetic nature of the plain text, some phrases appear more often than normal. Perhaps that is also increasing the difficulty.

http://zodiackillerciphers.com

Posted : October 6, 2015 7:29 am

traveller1st

(@traveller1st)

Posts: 3583

Member Moderator

Got it.

“I don’t know Chief, he’s very smart or very dumb.“

Posted : October 6, 2015 10:26 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

You managed to reproduce allot of patterns!

Do these patterns just emerge or are there certain things in your cipher generation process that increase the odds of it happening? I’m guessing the repeating (poetic) nature of the plaintext contributes to some of these patterns.

Some discrepancies that I find worthy of note.

Although it matches the non-repeat score, your cipher appears much more random than the 340 (based on other measurements).

2-symbol cycle score (percentual difference vs. randomized mean) 340: 178%, yours: 125%.
3-symbol cycle score (percentual difference vs. randomized mean) 340: 180%, yours: 124%.

Also individual distribution of symbols throughout your cipher is much more random.

The following image is a non-repeat graph comparison between the 340 and your cipher. For your cipher there is some extra bulk towards right side of the graphs, I believe this sub-peak may have happened because of hill climbing versus top 20 cycles. Could alter the non-repeats measurement to give less weight (logaritmic?) to longer unique strings.

AZdecrypt

Posted : October 6, 2015 12:44 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

Nice job, trav and Jarvle, for quickly solving the cipher. Do you think you would have gotten it without the hints?

Here’s another one. Should be easy to solve.

RT;%cqtp/+2F4Wk*|
5FBcK9O#y)HER6SW8
<+JlFM(2/k45R.zZ)
d9|k2C52WT&+zc9|L
GG+tLf>W^zjBd>6<c
4N#OVDA+zSYCZ1NB.
F^pNY+OT+@OpA69+7
*L(MdVc.bUXRKl2#B
O+5RlcHt8B4|MpD5T
-NBJ+.2^+|ylCp2<-
O+UpMbdSY1#j<Ff)z
pZ>KEUFL)V(&zN+:c
B+k6yl3MO(Cb*|JRH
zLq_pl+4GlCX|3.*-
OM#>_c2H;D6BJ4^|T
5(ktKP+)GG%dR(^8B
-KWVL.;+FBcSPFDy6
2B/pKP|MZY1f+UcF+
<+8p:-*fB/Ez+73^|
5FpO<(V7+yOG*KzR_

Do these patterns just emerge or are there certain things in your cipher generation process that increase the odds of it happening?

My algorithm is divided into two stages. In the first stage, it identifies plain texts that fulfill some of the criteria needed for specific patterns to appear (trigrams in the right place, pivots, fold marks, box corners). This is also the stage where misspellings, omitted/duplicated words, and filler are randomly introduced. In the second stage, a genetic algorithm conducts a multiobjective search to optimize each of the remaining objectives. So the algorithm isn’t intended to discover how often the curious patterns happen as a side effect of encipherment. Instead, it is more to answer the question: Under our construction hypothesis, is it possible for generated ciphers to resemble Z340 and its curious phenomena? And is it possible for us to effectively solve the generated ciphers? If so, this may provide stronger evidence against specific construction methods.

Thanks for your analysis of discrepancies. I’d like to identify as many of those as possible. Do you have a link to where you describe how you compute your cycle scores?

Since my algorithm only cares about the "top 20" cycles of 2 symbols, that might explain why there’s a big difference from the Z340. It may also provide more evidence that the apparent cycling in Z340 is truly information about the encipherment process "leaking" into our observations.

I wanted to compute a more thorough cycle comparison but it slows down the search considerably due to all the combinations of symbols involved. Perhaps I can remedy this simply by expanding from top 20 to to 50 or similar.

Also individual distribution of symbols throughout your cipher is much more random.

Is that conclusion based on your non-repeat graph or something else? I wonder if the graph will look more similar if I include more than 20 cycles in the comparison. Or if I also include 3-symbol cycles. I will do that if it doesn’t slow the multiobjective search down to a crawl.

http://zodiackillerciphers.com

Posted : October 6, 2015 2:20 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Is that conclusion based on your non-repeat graph or something else?

I have to go but can share this measurement for now.

A=summed_positions_per_symbol/symbol_frequency
B=total_chars_in_cipher/2
C+=abs(A-B)
repeat above for all symbols
score=C/B*340

In the 340 the first "+" symbol appears on position 20 and these positions are summed to get A. I also have the same measurement but then with 2D (grid) logic. It can for instance be used to determine the correct order of a cyclic h.s. cipher in parts (such as the 408 and jroberson) because the correct order typically scores lower. A higher score is more random. So h.s. ciphers will score lower, it’s another measurement that I made with homophonic substitution in mind so it’s more of the same really.

340: 1599
408: 1146
yours 1: 2644
yours 2: 2087

AZdecrypt

Posted : October 6, 2015 3:14 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

Thanks for that measurement description. It is an ideal addition to my set of objectives since it is fast to compute.

I’m trying to gain an intuitive sense of what the calculation represents; let me know if it is right:

A=summed_positions_per_symbol/symbol_frequency <== this appears to represent a "clustering" measurement. For instance, write the cipher in a straight line, and remove everything that isn’t a + symbol. This measurement would represent where you’d hold the line to balance all 24 +’s.

B=total_chars_in_cipher/2 <== I assume this is a midpoint of the line described above.

C+=abs(A-B) <== At each step, this measures how far a given symbol is from the expected balance point if the symbol was uniformly distributed.

repeat above for all symbols
score=C/B*340 <== A ratio of the sum of differences to the midpoint of the cipher, normalized by the cipher length.

http://zodiackillerciphers.com

Posted : October 6, 2015 4:32 pm

daikon

(@daikon)

Posts: 179

Estimable Member

I’m getting a stable solve for both ciphers as well using my own solver, but with a pretty low overall score, which means the recovered plaintext isn’t represented well in the N-grams corpus. Probably because of the introduced transcription errors and misspellings, and seemingly "arcane" subject of the plaintext didn’t help either. 🙂 But I’m pretty sure I can clean it up manually to get closer to the intended plaintext. Now if only you can figure out how to mimic the near absence of bigram repeats in the second half, and for symbols in odd positions, we’d be all set. 🙂

Posted : October 6, 2015 10:15 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Under our construction hypothesis, is it possible for generated ciphers to resemble Z340 and its curious phenomena? And is it possible for us to effectively solve the generated ciphers? If so, this may provide stronger evidence against specific construction methods.

I really like this idea, it must be fun and exciting but also quite a bit of work.

Do you have a link to where you describe how you compute your cycle scores?

The idea is based on my interpretation of smokie’s cycle measurement. Measure every n-symbol cycle in the cipher in some way. I feel it’s better for it to be not too exponential in nature. For instance "ABABABABAB" should not have twice the score of "ABABABAB". Depending on how you wish to measure the cycles some normalization matching symbols with different counts may be needed. Also could consider expected cycle length. My algorithm repeats this for 1000 randomizations of the cipher string and then calculates the percentual difference between the mean of the randomizations and the actual cipher.

I wanted to compute a more thorough cycle comparison but it slows down the search considerably due to all the combinations of symbols involved. Perhaps I can remedy this simply by expanding from top 20 to to 50 or similar.

Yes it’s typically very slow, I’ll think about it.

Thanks for that measurement description. It is an ideal addition to my set of objectives since it is fast to compute.

You understand correctly. My current implementation of the idea is certainly not perfect and some improvements could be made.

AZdecrypt

Posted : October 7, 2015 1:18 am

traveller1st

(@traveller1st)

Posts: 3583

Member Moderator

Nice job, trav and Jarvle, for quickly solving the cipher. Do you think you would have gotten it without the hints?

To answer that specific question … not sure. It would have certainly taken longer. It’s kinda tricky to be sure because I basically took what I assumed was the longest section of deciphered plaintext and did a search based off that. Then narrowed it down to what I felt was the correct section and then continued from there working back through the ciphertext inputing the plaintext. So I was sorta cheating, sorta not, 100% lazy lol. Not sure without a reference if I could have made sense of it so readily.

“I don’t know Chief, he’s very smart or very dumb.“

Posted : October 7, 2015 2:18 am

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

I really like this idea, it must be fun and exciting but also quite a bit of work.

It is very tedious. And I haven’t even moved on to other construction hypotheses yet (a nightmarish endless pit of possibilities). I’m still only focused on tuning the optimizations for normal homophonic substitution but I wanted to try to rule out Dan Olson’s hypothesis (splice together lines 1-3, 11-13), but the high multiplicity of those lines might be against me.

The idea is based on my interpretation of smokie’s cycle measurement. Measure every n-symbol cycle in the cipher in some way. I feel it’s better for it to be not too exponential in nature. For instance "ABABABABAB" should not have twice the score of "ABABABAB". Depending on how you wish to measure the cycles some normalization matching symbols with different counts may be needed. Also could consider expected cycle length. My algorithm repeats this for 1000 randomizations of the cipher string and then calculates the percentual difference between the mean of the randomizations and the actual cipher.

Makes sense. I like the idea of comparing them to randomizations as a reference point.

I wanted to compute a more thorough cycle comparison but it slows down the search considerably due to all the combinations of symbols involved. Perhaps I can remedy this simply by expanding from top 20 to to 50 or similar.

Yes it’s typically very slow, I’ll think about it.

For my multiobjective search I implemented a slightly faster cycle detection. The basic idea is to keep track of a list of perfect cycles encountered while doing one pass of the cipher. As it scans the cipher, it tracks a list of "active" and "inactive" cycles. At first, the list of active cycles is large. But as breaks in the cycles are found, they are removed and placed in a list of inactive cycles. So, the list of active cycles soon dwindles down to a reasonable size as you progress further into the cipher text. I think this is faster than the naive method of calculating every possible combination of n symbols and isolating the resulting sequences. My approach is similar to the King and Bahler method described in their REMOVE_HOMOPHONES algorithm, but does not yet take advantage of their other optimizations. Also, the weakness in this approach is that the cycles have to start out perfectly — there’s no accommodation for perfect cycles that appear after a first section of "junk".

http://zodiackillerciphers.com

Posted : October 7, 2015 2:26 am

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

To answer that specific question … not sure. It would have certainly taken longer. It’s kinda tricky to be sure because I basically took what I assumed was the longest section of deciphered plaintext and did a search based off that. Then narrowed it down to what I felt was the correct section and then continued from there working back through the ciphertext inputing the plaintext. So I was sorta cheating, sorta not, 100% lazy lol. Not sure without a reference if I could have made sense of it so readily.

Makes me wonder about a possible enhancement to an autosolver, where at the end of its search it presents lists of "fuzzy matches" of full words and phrases to the candidate plaintexts (sort of like when google sees your typo and says "did you mean ‘Forest’?). It could even try to build combinations of words based on language models that favor certain word combinations over others.

Easier said than done.

Jarlve, I believe you mentioned a "progressive solve" feature of azdecrypt. I wonder if there is any merit to starting with ngram statistics where n is small, then as the solution stabilizes, increase the value of n. Did you already do something like that?

http://zodiackillerciphers.com

Posted : October 7, 2015 2:32 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Your first cipher is quite hard, without the Gilbert and Sullivan tip I wouldn’t have found the plaintext but I was quite sure of some fragments in the plaintext which happened to be actual. Also the cipher scores higher than the 340. But the difference is more pronounced with higher n-grams. Bottom line, we should have gotten it in a group effort. I still believe the 340 could be like the 408 but to a much, much greater extent.

Jarlve, I believe you mentioned a "progressive solve" feature of azdecrypt. I wonder if there is any merit to starting with ngram statistics where n is small, then as the solution stabilizes, increase the value of n. Did you already do something like that?

I’ve thought about it some while ago. I haven’t tried it but it may be a good optimization. I see the implementation as follows, for example and given n amount of iterations.

n=1.000.000
1 to 100.000 (use 2-grams)
100.000 to 200.000 (use 3-grams)
200.000 to 300.000 (use 4-grams)
400.000 to 1.000.000 (use 5-grams)

And just optimize the ranges afterwards to wide variety of ciphers. I’m very sure that it would not be an improvement to the multiplicity but it could be an improvement to the recovery rate per random restart within an equal time period.

The progressive feature is in the latest version (0.99) and just doubles the amount of iterations after n amount of restarts.

AZdecrypt

Posted : October 7, 2015 10:29 am

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

I also wonder how well the search would work if each ngram size is treated as a separate objective. For instance, instead of a single ngram score, you maintain an array of scores (such as: [2gram, 3gram, 4gram, 5gram]). Then you maintain a set of solutions at each iteration (instead of just one), such that no single solution is dominated by any other solution in that set. At a given iteration, you may have a solution that excels at the 2gram score but not in the others, and a solution that excels at the 5gram score but not the others. I think it may improve the explorational aspect of the search.

That is essentially how my cipher generator algorithm attempts to optimize multiple objectives during its search.

http://zodiackillerciphers.com

Posted : October 7, 2015 1:42 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

I’m not sure if I understand your idea. Hill climb 4 copies of the cipher, one with 2-grams, one with 3-grams and so forth. And then?

AZdecrypt

Posted : October 7, 2015 2:32 pm

doranchak

(@doranchak)

Posts: 2614

Member Admin

Topic starter

No; let me try to explain it better.

Take one cipher (call it C) and make an array of its ngram scores: [n(2), n(3), n(4), n(5)].

Modify its key, then score the resulting cipher (call it C’): [n'(2), n'(3), n'(4), n'(5)].

The new score is worse than the old one if n'(i) < n(i) for all i.
The new score is better than the old one if n'(i) > n(i) for at least one i, and there is no n'(i) < n(i) for some i.

But C’ is "non-dominated" by C (that is, neither better or worse than C) if n'(i) > n(i) for some i, and n'(i) < n(i) for some other i. Think of it is a "trade off" among objectives.

Here’s a more concrete example: Let’s say you are building cars. You want to minimize cost, but maximize safety. You can build the safest car, but at terrible cost. Or, you can build the cheapest car, but with terrible safety.

So, you simulate a car factory with different parameters, and measure the resulting safety and cost of cars:

The red dots are the "non-dominated" results. All the gray ones are dominated by the red ones.

So, instead of keeping the "one best" cipher, you keep a collection of the non-dominated ones. When you score a new cipher, you see if it dominates any of them. If so, you remove them from your collection and put in the new cipher. Continue in this way, and you are exploring a wide variety of optimal solutions with respect to multiple objectives.

http://zodiackillerciphers.com

Posted : October 7, 2015 3:35 pm

Zodiac Discussion Forum

Cipher challenge