Homophonic substitution

Jarlve · 2015-08-02T16:42:57Z

This thread is a continuation of viewtopic.php?f=81&t=267 in which several aspects of the Zodiac 340 cipher are discussed and researched. I'd like to continue the work from there in this thread because then I can use the main post to reference and update all the cipher material being discussed. Some of the questions which the contributors are trying to answer: - Is the 340 a straightforward homophonic substitution cipher or is there something else going on? - The 340 does not seem to cycle as well as the 408, what is going on? (doranchak:... _sequences) - To what extent is the 340 cyclic or random? Can we find areas - as for instance with the last part of the 408 - that are more random? - Is it possible to attribute the 340 not cycling as well as the 408 (despite its higher symbol count) due to some transposition after encoding? - Some of the medium-high count symbols do not seem to cycle well, are these possibly wildcards/polyalphabetic or 1:1 substitutes? (smokie treats) - Can we make a system that can adequately group homophones that belong to the same letter without having to solve the cipher? (smokie treats, glurk) - Is there a discrepancy between symbols/cycles/etc on odd and even positions for the 340? If so, what could be causing this? (daikon, doranchak, smokie treats) - There is a significant bigram repeat peak at period 19, is this a lead to the encryption scheme of the 340? (daikon) Related: 2 symbol cycle analysis for the 340 evens only. (doranchak) 2 symbol cycle analysis for the 340 odds only. (doranchak) Symbol position factors for the 340, 408 and smokie ciphers. (doranchak) 340 cipher numeric and symbolic version: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 5 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 20 34 35 36 37 19 38 39 15 26 21 33 13 22 40 1 41 42 5 5 43 7 6 44 30 8 45 5 23 19 19 3 31 16 46 47 37 19 40 48 49 17 11 50 51 9 19 52 53 10 54 5 44 3 7 51 6 23 55 30 17 56 10 51 4 16 25 21 22 50 19 31 57 24 58 16 38 36 59 15 8 28 40 13 11 21 15 16 41 32 49 22 23 19 46 18 27 40 19 60 13 47 17 29 37 19 61 19 39 3 16 51 20 36 34 62 63 53 31 55 40 6 38 8 19 7 41 19 23 5 43 29 51 20 34 55 38 19 3 54 50 48 2 11 25 27 20 5 61 14 37 31 23 16 29 36 6 3 41 11 30 50 14 53 37 28 19 52 20 51 40 63 47 42 34 22 19 18 11 50 51 20 36 21 58 44 3 6 15 51 18 7 32 50 16 53 61 28 36 8 53 48 19 19 34 20 59 12 30 35 53 47 56 2 4 8 38 39 50 55 19 11 36 28 45 40 20 31 21 23 5 7 28 32 37 57 15 16 3 36 14 19 13 12 63 56 29 19 51 6 26 20 11 33 13 19 19 33 26 56 40 26 36 9 23 42 1 14 54 21 33 5 11 51 10 17 26 29 43 48 20 46 27 23 20 30 55 56 36 4 37 25 1 18 5 10 42 40 39 23 44 62 11 31 58 19 HER>pl^VPk|1LTG2d Np+B(#O%DWY.<*Kf) By:cM+UZGW()L#zHJ Spp7^l8*V3pO++RK2 _9M+ztjd|5FP+&4k/ p8R^FlO-*dCkF>2D( #5+Kq%;2UcXGV.zL| (G2Jfj#O+_NYz+@L9 d<M+b+ZR2FBcyA64K -zlUV+^J+Op7<FBy- U+R/5tE|DYBpbTMKO 2<clRJ|*5T4M.+&BF z69Sy#+N|5FBc(;8R lGFN^f524b.cV4t++ yBX1*:49CE>VUZ5-+ |c.3zBK(Op^.fMqG2 RcT+L16C<+FlWB|)L ++)WCzWcPOSHT/()p |FkdW<7tB_YOB*-Cc >MDHNpkSzZO8A|K;+ Alterations of the 340: - In relation to the bigram peak at period 19: Scheme: move 1 row down, 2 columns right and repeat (wrap around cipher): 340_1rd-2cr-w.txt (doranchak) Grid 19 by 18, direction North-East (vertical) and 2 "?" symbols added: 340_19by18_n-e.txt Grid 20 by 17, direction SW-SE (diagonal): 340_20by17_sw-se.txt Grid 17 by 19, 17 symbols filler at end, vertically untransposed: 340_323_17.txt (smokie treats) Grid 17 by 20, 16 symbols filler at end, vertically untransposed: 340_324_16.txt (smokie treats) Grid 17 by 20, 15 symbols filler at end, vertically untransposed: 340_325_15.txt (smokie treats) Grid 17 by 20, 14 symbols filler at end, vertically untransposed: 340_326_14.txt (smokie treats) Grid 17 by 20, 13 symbols filler at end, vertically untransposed: 340_327_13.txt (smokie treats) - In relation to the odd/even encoding scheme: Evens only: 340evens.txt Odds only: 340odds.txt Randomized, shuffled: 340shuffled.txt (doranchak) Tools/links/solvers: - David Oranchak Zodiac Killer Ciphers:Zodiac Ciphers wiki:... =Main_Page CryptoScope:340 Webtoy:Zodiac Pattern Drawer:| (info) Word Search Gadget:- glurk ZKDecrypto:and viewtopic.php?f=81&t=2268 - Michael Cole The Zodiac Revisited:- Jarlve AZdecrypt:Visualizations: - In relation to the bigram peak at period 19 and 15 (mirrored 340): Doranchak's ngram viewer. Doranchak's period calculator. Doranchak's fragment explorer. Test ciphers: I'd like to introduce a whole new range of ciphers to test on, mainly being homophonic substitution but with different schemes. More will be added and particular schemes can be requested. All of these ciphers can have low count 1:1 substitutes. Please use the proper names of the ciphers when referencing them. There should be no errors in these ciphers but the number of homophones per letter were handpicked each time to introduce a human element. Perfect cycles: c_p1.txt c_p2.txt c_p3.txt Randomization of cycles: (the numb...

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

I am going to try to operate within tighter constraints. Try to make a message with initially high bigram repeats, so that after masking there are about 46 left over. Yet there have to be enough wildcards to make the message unsolvable. The last one was easy because I didn’t have a constraint on the final number of bigram repeats. I don’t know much about that subject, but I am going to start with a key that makes a symbol distribution as flat as possible. Then adjust the key. I am guessing that fewer symbols mapped to high frequency letters and more symbols mapped to lower frequency letters might do the trick. But I don’t know yet. Try to do some randomizing of the cycles as well, so that I can match 340 stats. If the cipher will fit into a neat little box, then we may have something. If I can’t make an unsolvable message with 46 bigram repeats, then then we may not have something. I just finished a spreadsheet that makes messages and allows me to select different cycles for randomization or not, and different percentages of randomization depending on where in the message you are.

Posted : August 9, 2015 10:05 pm

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

I have a question.

Method 1: I have a message, and randomly place X number of wildcards on top of X number of symbols.

Method 2: I use the same message from Method 1, and use X number of wildcards to mask X number of bigrams.

On average, will the altered message from Method 2 be more difficult to solve than the altered message from Method 1?

If nobody answers, no problem, I will keep going anyway and eventually figure it out. But if anyone knows the answer, it would be greatly appreciated.

Smokie

Posted : August 10, 2015 3:37 am

daikon

(@daikon)

Posts: 179

Estimable Member

If I had to guess, and this is purely a guess at this point, but I would say Method #2, where you mask bigram repeats, will be harder to solve. Reason being, the auto-solvers use the intrinsic order (in the sense of the opposite of randomness) in the cipher to converge on the correct solution. Basically they tweak the key (which letter each symbol in the cipher maps to) little by little until the solution makes the most sense. It helps a lot for different parts of the cipher to have the same patterns. It’s kind of like assembling a jigsaw puzzle, where different pieces are linked together, so that when you place one piece (assign a letter to a symbol) all other linked pieces (all occurrences of that symbol in the cipher) have to fit as well (i.e. form a text that looks like normal English sentences). A bigram, a two-letter sequence, being the simplest form of a pattern, is bound to repeat the most, so it is the main building block of all repeating patterns in the cipher. And if you target bigram repeats, and destroy them specifically, you are more likely to arrive at a hard to solve cipher because there will be less order, and more randomness, in the cipher.

Hmm, now that I put the whole reasoning into words, I’m convinced that Z could’ve followed the same logic, and did specifically try to mask as many bigram repeats as he can spot. There were no known automatic software solvers at the time of course, but Z might have thought that FBI had some powerful mainframes that were capable of that task at the time.

Posted : August 10, 2015 4:00 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

Thanks for the answer. I think that you are probably right, but had to ask. The reason for the question is because I am trying to make a message that has enough bigram repeats so that after I mask, I get to about 47 remaining bigram repeats. Most of my messages have about +/- 100 bigram repeats to start with. With one I got 109. If I try to randomize the cycles, the number goes down. I am wondering if I start with about 100 bigram repeats and mask about 24 bigrams, will that do the job? If I can find a way to distribute about 60 symbols amongst the letters in the key to make that work, then no problem.

Still working on it, but the exercise is instructive. There are several high count (about 10) symbols in the 340. Some cycle well with each other, and some do not. I suspect that Zodiac may have lowered the count of symbols for high frequency letters, and raised the count for low frequency letters. He may have done that thinking that it could make frequency analysis more difficult. But it raised the number of bigrams a bit.

I also think that he may have masked the traditionally high frequency bigrams, targeting them on purpose instead of just masking at random. The reason I am saying this is because with the list of bigram repeats, there are several symbols which I have always suspected as wildcards because they are high count, and don’t particularly cycle well with others. But, if he could have randomized a bit, those symbols may not be wildcards at all. Not sure yet. But if he targeted traditionally high frequency bigrams, then those may be the ones that he didn’t target.

Also, if the message has at least a few low frequency letters, like B, K, or Z, that make a big difference. I can assign more symbols to those letters and fewer symbols to the high frequency letters. All in all, the distribution of symbols in the key does not look like it would if making the distribution of symbols as flat as possible. It’s usually a more uniform or flat looking key. That would explain why there are so many low count symbols in the 340. He may have assigned them to low frequency letters, resulting in low count.

I’ve just been messing around with that, and have not decided on any message to mask yet. I have been trying some of Zodiac’s letters. And some other stuff too, which gives me a few more initial bigram repeats.

Thanks again.

Another question for anyone out there. Has anyone ever tried to do this before? Has anyone ever tried to figure out a way that Zodiac could have masked bigrams with wildcards in a fairly cyclic message? I mean, if someone has already tried to do this, I would prefer to just read about it instead of doing it. Just wondering if I am going down a rabbit hole that someone else has already gone down.

Smokie

Posted : August 10, 2015 4:38 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

Anyway, I’ve done my second masking exercise. It was very educational. I made a message with perfect cycles. Then randomized them a little bit toward the end of the message. Then I started with 110 bigram repeats, masked 32 of them with wildcards, leaving about 46 bigram repeats left. I have to double check that. ZKD is working on it right now, stuck at about 32k after fifteen minutes. I found out that bigrams sometimes sit next to each other. During the masking process, I doubled up some of the wildcards. I didn’t seek out to do that, but when given a choice between two bigrams to mask, and masking one already caused another bigram to be created, it was easier to mask the bigram that caused a double wildcard symbol. Going through the process really helped me understand a lot. I encourage people to try it. I would like to see if I can make an unsolvable message with only 24 wildcard symbols. That would have been a lot easier, because I could have used a key with mountains and valleys instead of trying to level out the symbols across the letters so that there would be more initial bigrams, then shifting symbols around. I have to get some sleep now. If anyone wants to try to solve it, here it is. Symbols 61, 62 and 63 are the wildcards. The cycles are randomized 2% to 6% incrementally as you go through Rows 9-20.

17 25 1 15 27 62 9 18 26 35 16 10 13 31 28 37 58
7 36 15 9 32 10 22 17 39 9 8 2 16 59 3 4 18
61 25 27 61 56 26 60 33 35 44 49 17 31 36 45 41 10
35 62 28 21 9 11 18 22 61 61 57 42 17 36 16 35 61
9 10 58 40 34 59 12 53 27 32 23 51 61 25 7 2 26
28 61 46 55 52 24 9 61 21 58 27 31 44 10 36 56 8
32 45 47 60 31 9 33 1 25 49 55 16 28 22 10 54 61
35 15 26 59 35 63 17 58 14 18 25 17 35 36 27 34 18
35 57 28 41 26 59 58 27 32 61 28 9 2 35 17 36 42
56 51 60 15 59 48 3 18 61 16 27 21 10 1 61 40 36
15 2 61 23 9 56 26 52 5 28 23 11 59 31 36 17 35
16 61 7 1 29 10 32 12 9 6 36 22 44 31 27 38 58
8 49 61 59 32 21 18 19 10 2 30 27 31 61 15 28 22
9 29 2 17 25 36 10 57 13 32 9 62 26 53 62 63 16
60 33 15 17 58 45 55 9 21 62 59 54 4 31 1 34 51
20 25 27 47 18 26 36 62 10 9 43 2 5 35 24 17 40
7 21 61 36 61 61 8 63 59 32 27 30 61 58 9 49 28
25 35 59 56 36 37 3 10 52 61 60 29 9 57 15 1 22
21 22 18 19 61 61 35 38 26 58 9 21 56 39 10 31 44
6 27 50 11 59 32 36 60 3 22 9 35 37 25 26 10 21

Hope I didn’t make any mistakes. Good night.

EDIT: I made some mistakes. I started with 113 bigram repeats, but made some too, ending with 56 total because of the double wildcards. But I have a little system down now, and can work on fewer wildcards and bigram repeats next time.

Posted : August 10, 2015 6:26 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

I think that Zodiac used a relatively flat key to encode the 340, meaning that he used close to the same number of symbols for each letter. Maybe there was some variation, but generally hIgh frequency letters resulted in high count symbols, and low frequency letters resulted in low count symbols. That’s why there are quite a few low count symbols. Then he realized that because he was using cycles, he was creating a lot of bigram repeats. For example, if you use three symbols for T, H and E respectively in a cyclic message, there are going to be a lot of repeating bigrams. Or even if you use four symbols for E and T and three symbols for H (Does that cause more bigram repeats in a cyclic message to do it that way?). Daikon, that’s the hidden order.

He needed a way to change the message somehow so that it could not be solved so easily. So he used some combination of randomization and wildcards to mask all of the bigram repeats that he created with his unsophisticated cycle system. He was just an amateur who improvised one extra step that he didn’t use with the 408. I don’t know how to solve it, but that’s what makes more sense to me. I scrap the randomly placed wildcard hypothesis. Yesterday I burned a lot of time messing around with different messages, keys, and lists of bigram repeats. I wonder if anyone has just tried to make a fairly flat key (not flat distribution of symbols throughout the message, flat distribution of symbols across the letters on the key) for the first 340 of the 408 and then tried to mask 24 bigrams with the same symbol. EDIT: Will masking the most common digraphs, like TH, rather than some of the less common digraphs, make a message more difficult for ZKD to solve? Can you get the same effect with fewer more strategically masked bigrams as compared to more less strategically masked bigrams?

Posted : August 10, 2015 2:54 pm

masootz

(@masootz)

Posts: 415

Reputable Member

He was just an amateur who improvised one extra step that he didn’t use with the 408.

yes yes yes yes. there’s no reason to assume he was an expert cryptographer. it doesn’t require advanced knowledge to make a one time encryption that no one has to be able to decode. the 408 was broken because he did something obvious (using "i" and "kill") so with the 340 he wanted to add a step to make it harder. he might even have wanted to make it close to impossible but i don’t think he wanted to make it completely impossible, because the part of his brain that was playing a game with the public and le wanted to feel superior. that feeling of superiority comes from giving someone something complicated that they’re too "stupid" to figure out, not giving them something impossible. just my 2 cents.

Posted : August 10, 2015 4:17 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Anyway, I’ve done my second masking exercise. It was very educational. […]
17 25 1 15 27 62 9 18 26 35 16 10 13 31 28 37 58

Cracked it! It’s a quote from J.R.R Tolkien. Feeding this cipher to the solver as-is didn’t produce any results. So I was ready to give up right there and report that I couldn’t solve it, but then I thought, well, let’s see if we can find a way to crack it after all. The method I came up with takes more time, but I got it to produce an approximate text that I was able to manually clean up further. I had the solver running overnight to attempt to crack Z340 using the same method, but it didn’t produce anything meaningful. I’ll keep it running for a while more, to see if something pans out. Should I disclose the method, or does someone else want to give it a try to solve this cipher?

Posted : August 10, 2015 8:53 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Another question for anyone out there. Has anyone ever tried to do this before? Has anyone ever tried to figure out a way that Zodiac could have masked bigrams with wildcards in a fairly cyclic message? I mean, if someone has already tried to do this, I would prefer to just read about it instead of doing it. Just wondering if I am going down a rabbit hole that someone else has already gone down.

I could be mistaken, but I think you are trailblazing here. I’m not aware of any other attempts to explore the wildcards idea. I think I’ve seen it mentioned before, and someone might’ve done it privately, but I don’t think I’ve come across any serious research in that direction.

By the way, what are the prime candidates for wildcards in Z340 in your opinion? The most frequent symbols: ‘+’, ‘B’ and ‘p’?

Posted : August 10, 2015 9:00 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Will masking the most common digraphs, like TH, rather than some of the less common digraphs, make a message more difficult for ZKD to solve? Can you get the same effect with fewer more strategically masked bigrams as compared to more less strategically masked bigrams?

I would say, yes, masking the most common digraphs would make it harder to crack, but probably not hugely. I.e. it won’t make a solvable cipher into unsolvable. Auto-solvers use N-gram stats from common English texts, so if you take away the most frequent stats by masking TH, it is more likely to get confused, but there will be still enough information in the cipher to get to the solution. For example, if you mask TH in "ITWAS*HEDAY", there are still N-grams before and after ‘*’ that can be used to arrive at the correct solution.

Hmm, this makes me think it is best to place wildcards very evenly throughout the cipher, so it masks/destroys as many higher-order N-grams as possible, so that ZKD/AZD have nothing reliable to use to score the solution. Every 4th or 5th symbol would be ideal. I highly doubt Z would realize that though. I think he would be much more likely preoccupied with masking bigrams, to hide "KILL", etc., so that it won’t be solved as easily as Z408.

A while back I actually thought of another way of masking repeating bigrams, that doesn’t involve wildcards. What if Z used some of the cipher symbols to represent more than one letter? A bigram perhaps. ‘+’ would *not* be one of them though, as it occurs too often in Z340. Even the most frequent bigram, ‘TH’, isn’t that frequent. Also ‘+’ doubles itself 3 times, and ‘THTH’ just doesn’t happen. But let’s focus on wildcards for now. I just wanted to mention this to perhaps research later.

Posted : August 10, 2015 9:30 pm

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

Thanks for working on this with me. My updated list of wildcard suspects is:

19 +
26 W
20 B
51 F
5 q

In that order. When I total all of the cycle scores for each symbol when compared to every other symbol ( e.g. 1 and 2, 1 and 3 … 1 and 63),these symbols have the lowest overall total scores in their respective categories sorted by count from high to low.

Symbol 26 could be a 1:1 substitute or a wildcard because it has very low overall cycle scores and does not cycle at all with any other symbol. After doing the masking exercise, I decided to add it to the list because I cannot assume that a wildcard must be high count. To try to mask even 30 bigrams when trying to get a final bigram repeat count of 47 was a bit challenging. So with 19 having a count of 24, I see that 26 could easily be a wildcard if Zodiac did it this way. In fact, it could be the only other wildcard besides 19.

Symbol 20 does not cycle with another symbol in any remarkable way.

Symbol 51 does cycle with symbol 23 O a bit: 23 23 51 51 23 51 23 51 23 51 23 51 51 51 23 51 23 51 23 23, which is not particularly remarkable given all of the random cycles generated. On the other hand, Zodiac randomized the cycles as well. 51 is close to the bottom of the list.

Symbol 5 only cycles with symbol 29 < about 1/3rd of the way into the message: 5 5 29 5 5 5 5 29 5 29 5 29 5 29 5 29 5, which includes ten consecutive alternations. That can happen about 1-2 times throughout the entire 340 if I scramble the 340. A lot of the cycles get more randomized the further you go down the message. This one is the opposite. I suspect that Zodiac could have done the reverse with only this one cycle. Or it is random. If this isn’t an intentional cycle, then the 5 q may be a wildcard. I find it difficult to believe that 5 could be a wildcard, yet just coincidentally also cycles with symbol 29. The double q q could just be randomization. It is at the bottom of the list.

Try +, W and B. See what happens. I am thinking that because of bigram repeat stats, it is likely either + and W or + and B. Likely not all three.

The Tolkien message was more cyclic than the 340, and doesn’t have the same symbol count stats. It’s not particularly easy to mimic the 340 with this cipher method, but I am going to keep working on it. If I can make a message, distribute the symbols across the key that will closely mimic the 340 symbol count stats, then cycle the message and randomize it to mimic the cycle stats, then mask enough bigrams to mimic the bigram repeat stats, and it cannot be solved, then we may be in business.

I have tried to randomly place all of the 19, 20, 51 and 5’s on a perfectly cyclic message and the message won’t solve. But I didn’t check the bigram repeat stats. I shold probably do back and do that to see what they were.

Posted : August 10, 2015 11:26 pm

daikon

(@daikon)

Posts: 179

Estimable Member

Try +, W and B. See what happens. I am thinking that because of bigram repeat stats, it is likely either + and W or + and B. Likely not all three.

Ok, I’m running my solver with the following sets of possible wildcards for Z340: {‘+’,’B’,’W’}, {‘+’,’W’}, {‘+’,’F’} and {‘+’,’B’,’F’}. It’ll take a few hours to get enough restarts to see if it’s going anywhere, since I have to test each set separately and it’s a slow method. I’ve already tested {‘+’,’B’} and {‘+’,’B’,’p’} overnight, since they are the most frequent symbols in Z340. ‘q’ only occurs twice, so it shouldn’t be in the way of solving the cipher even if it’s a wildcard. ‘W’ is borderline, as it occurs 6 times, but mostly in the end of the cipher, and twice in the beginning, so it should leave a big section in the middle perfectly solvable. But I thought I’d test it anyway. I’ll let you know if there are any results!

Posted : August 11, 2015 2:30 am

smokie treats

(@smokie-treats)

Posts: 1626

Noble Member

‘q’ only occurs twice

Symbol 5 for me is the Symbol at Column 1, Row 1. I guess it looks more like a backwards P. There is a double "qq" on Row 4, which is part of what made me thing that it may be a wildcard. Because Zodiac treated it differently.

Sorry about that, but I doubt that q is a wildcard anyway. Let’s see what happens with your solver with what we have now and I will be interested in what you tried. Thanks!

Posted : August 11, 2015 3:04 am

daikon

(@daikon)

Posts: 179

Estimable Member

‘q’ only occurs twice

Symbol 5 for me is the Symbol at Column 1, Row 1. I guess it looks more like a backwards P. There is a double "qq" on Row 4, which is part of what made me thing that it may be a wildcard. Because Zodiac treated it differently.

Oh, I see, that’s ‘p’ in WebToy transcription. It’s the 3rd most frequent symbol in Z340, so I think it should be included in the tests based on that. I already included it in one set I tested: {‘+’,’B’,’p’}, and I’m planning to test {‘+’,’p’} as well later.

Let’s see what happens with your solver with what we have now and I will be interested in what you tried. Thanks!

I think there is no harm in describing it. If you don’t want to learn how I cracked the most recent cipher in this thread (starts with "17 25 1 …") just yet then STOP READING NOW. 🙂 The first idea I had, and it’s a bit technical, but I’ll try to describe it in the simplest terms. Basically, a "wildcard" means "anything goes here". Solvers generally can’t handle that because they use specific 4- or 5-gram stats from a collection of existing books, news articles, usenet postings, wikipedia articles, etc. (usually called a corpus). Let’s use 4-grams just for the sake of simplicity. These tables look like this:
…
NTER 2617
ENTS 2616
ECON 2602
NING 2595
COMP 2558
…
For each 4-gram you have the number of times it was found in the corpus. If you divide that number by the total count of all 4-grams in the corpus, you’ll get the frequency, or probability, of each 4-gram in English language. You can use that data to score each proposed solution to be a coherent English text (as opposed to some gibberish). The idea is that if we split the solution into 4-grams (overlapping), and then add up the frequencies/probabilities for each 4-gram, we’ll get the overall score which will be higher for texts that look more like English (i.e. contain frequent 4-grams found in English texts), and lower score for gibberish. There is actually a proper mathematical basis for all this, based on statistics and probabilities, but the simplification also makes sense from purely common sense point of view.

The problem with wildcards is that you can not use the 4-gram tables to score the solution, as they contain only actual letters. At first I thought about building new tables, with wildcards included. But then I realized that I can already use the existing tables. You just need to iterate through all possible letter substitutions for each wildcard, and look for the maximum score out of all of them. So, for example, if you need to score "AB*D", where "*" is a wildcard, you substitute it with each letter in the alphabet, and then consult the 4-grams table. Let’s say you have these entries (purely fictional) that match "AB*D", and their respective scores:
ABAD 37
ABOD 21
ABID 12
You take the maximum score, which is 37, and as an added bonus, you get the best candidate for the wildcard (‘A’). You add the score to the overall score, and that’s it! I was super excited about this, but turns out it doesn’t work. Why? The 4-grams overlap in the solution. For example, if you have: "ABC*FGH", you need to add up scores for the following 4-grams: "ABC*", "BC*F", "C*FG" and "*FGH". It is very likely that you’ll end up with maximum scores that have different letters in the place of the wildcard, and therefore with several best candidates for the exact same wildcard. Which one do you use? I was thinking about figuring out the combined maximum for each letter, but it was already getting quite complicated, and slow… And then it hit me, there is an easier way!

What I ended up doing was the old idea of replacing each ‘+’ symbol with a new unique symbol. It essentially turns it into a wildcard (i.e. each occurrence of ‘+’ will be its own letter) without the solver knowing anything about wildcards. You just don’t stop there and replace the second wildcard with a new set of unique symbols. And the third one, if you have to. Yes, you end up with a very high multiplicity cipher (i.e. many unique symbols). But I thought if you do enough restarts, the solver might converge on the correct solution. Or somewhat correct, as the case may be. And I was right! For the Tolkien cipher, in about an hour, I got 2 top solves that looked remarkably similar. When I compared the parts that were the same in both, I started to recognize English sentences, with a few letters off. For example "HOBBIT", being a fairly rare word, was consistently solved to "HOBBIS". 🙂 You can even try it on your own with ZKD or AZD.

Let’s hope this works for Z340. 🙂

Posted : August 11, 2015 4:23 am

daikon

(@daikon)

Posts: 179

Estimable Member

Forgot to mention that this method also worked for the earlier "purple haze" cipher, when I "expanded" the top 3 out of 4 wildcards (37, 49 and 51). Took about 3 hours to get a solution. So I think if Z340 was indeed encrypted with several wildcards, and we can correctly guess the top 3, we should be able to solve it. Here’s hoping, right? 🙂

Posted : August 11, 2015 4:29 am

Zodiac Discussion Forum