Zodiac Discussion Forum

I know what he (may…
 
Notifications
Clear all

I know what he (maybe) did with the 340!

61 Posts
9 Users
0 Reactions
8,718 Views
daikon
(@daikon)
Posts: 179
Estimable Member
 

Ah, right, I saw that table, but didn’t make the connection to Kevin Knight. I was hoping for a more formal paper, but I guess there isn’t much there to talk about. I should probably post my own findings in a separate thread. No breakthroughs and I’m almost sure you guys already stumbled over the same observations, but it’s always good to get an independent confirmation.

 
Posted : July 25, 2015 6:27 am
daikon
(@daikon)
Posts: 179
Estimable Member
 

I know this is not exactly within the topic of this thread, but I can’t resist to share this. So I started reading different scientific papers that attempt to solve Z340 (thanks, doranchak, for pointing me in the right direction). All the usual suspects among algorithms seem to be covered: hill-climb, simulated annealing, genetic algorithm, etc.. Bayesian Inference approach is very interesting, as from my understanding it can actually handle polyphonic substitutions as well (when more than one plaintext letter share the same cipher symbol). But the authors didn’t include any source code, and I’m not fluent enough in Bayesian theory to understand everything they’ve talked about.

Then I come across Beam Search paper. First off, they say "(Ravi and Knight, 2011a) report the first automatic decipherment of the Zodiac-408 cipher", which is… well, not true. But I keep reading. They proceed to claim that their approach "outperforms the current state of the art while being conceptually simpler and keeping computational demands low". Hell, sign me up! Better than "current state of the art"? Simpler? And computational demands low? Even a rube like me will be able to implement it in a couple of hours and solve Z340 by the morning, I’m sure!

So I keep reading. Then they claim "the optimal solution is found in 1/400th of the time needed using A* search". I don’t know what "A* search" is, but 1/400th of the runtime seems huge. Sign me up! I feel all excited! Z340 plaintext, here I come! So I trudge through a few paragraph of dense math, trying to remember my college days. I get most of it, but they lose me near the end, once they start talking about "notation for relations" and "partial cipher functions".

Well, I look over the pseudo-code for the algorithm, which is only 21 lines long and they lose me pretty much right away. And I know way more about programming than maths. Well, so much for "conceptually simpler". I start suspecting that it’s not going to everything they promised. So I skip over the lengthy explanation for their "simple" algorithm, to the section with the actual results. First, they talk about a 1:1 substitution cipher. And I see a table with runtimes of 120.38 and 151.85 seconds for a 128-character cipher. Uh-oh. They need over 2 minutes to solve a simple 1 to 1 substitution?! There goes their promise of low computational demands, and my hopes of tackling Z340 in a few hours.

By the time they get to homophonic substitutions (and they use Z408 as a test), I don’t expect much. Make sure you are sitting down, just in case. πŸ™‚ Because I sure glad I was when I saw the time it took their algorithm to solve Z408. It took 1,706 seconds to get to 94% correct solution, and 14,724 seconds to 96.8%. And I couldn’t believe my eyes when I saw this under the table: "Runtimes are reported on a 128-core machine." What?! Half an hour on a 128-core machine?! Which means it’ll take over 14 hours for a "regular" 4-core computer, in case you don’t have access to a 128-core monster. And if you want to get to 97% correct solution, you’ll need 5 and a half days! I sure glad ZKD and AZD don’t use their "better than current state of the art" algorithm. :))

 
Posted : July 25, 2015 8:50 am
doranchak
(@doranchak)
Posts: 2614
Member Admin
 

It’s funny (and frustrating) how often papers claim to have invented the first automatic decipherment. I think it’s because a lot of these researchers simply aren’t aware of prior efforts, especially since older successful programs like zkdecrypto aren’t documented in academic papers.

I hope to raise some awareness of this in my upcoming crypto symposium talk in October.

http://zodiackillerciphers.com

 
Posted : July 25, 2015 1:41 pm
glurk
(@glurk)
Posts: 756
Prominent Member
 

It’s funny (and frustrating) how often papers claim to have invented the first automatic decipherment. I think it’s because a lot of these researchers simply aren’t aware of prior efforts, especially since older successful programs like zkdecrypto aren’t documented in academic papers.

I hope to raise some awareness of this in my upcoming crypto symposium talk in October.

Part of the problem – from my end, anyway – was that ZKD was never meant to be an ‘academic’ project and was not written as a thesis or attempt to get a degree. It was written for fun, and also to make an honest attempt at the Z340 cipher!

That said, there was one other paper that really impressed me, and I can’t find it! It was basically a paper about using a hill-climber, but it had an additional step – or extra loop – and I wish I had used those ideas. I can’t find it now, and forgot who wrote it.

-glurk

——————————–
I don’t believe in monsters.

 
Posted : July 25, 2015 1:49 pm
glurk
(@glurk)
Posts: 756
Prominent Member
 

Much thanks to doranchak who has updated his cipher reference wiki with research papers here:

http://zodiackillerciphers.com/wiki/ind … rch_papers

In my own cynical opinion, many of these papers were written more as a goal of obtaining a degree by writing a thesis than actually making an effort at effective cipher solving. Which is understandable, but my personal opinion is that the best work is being done by interested amateurs working on the ciphers because they like to.

That said, the paper I referenced before, and that doranchak has found, is this one:

http://www.cs.sjsu.edu/faculty/stamp/RUA/homophonic.pdf

There are some great ideas here that I would have used in ZKD if I had it all to do over again. Just my opinion here, but I think that this research is by FAR the best idea on solving homophonic ciphers.

-glurk

——————————–
I don’t believe in monsters.

 
Posted : July 26, 2015 2:33 pm
Jarlve
(@jarlve)
Posts: 2547
Famed Member
Topic starter
 

Just imagine ZKDecrypto didn’t exist with all these academic papers complicating the matter.

AZdecrypt

 
Posted : July 26, 2015 3:15 pm
Jarlve
(@jarlve)
Posts: 2547
Famed Member
Topic starter
 

The best results from my tests with AZdecrypt have been from horizontal transposition schemes (row mirroring in 17×20 and 20×17, random columnar transposition runs. This is not just from the amount of combinations generated by those but also from the higher average results. It seems to be that on average the 340 likes to sit just where it is!!!

Something which I said earlier. I believe that my solver will score higher in the direction of the encoding when no solution can be found. Because there seems to be a positive correlation between higher non-repeat scores and solver scores. It’s an observation of some worth with my quoted statement and possibly in general.

It’s something which I noted earlier but had forgotten about, as usual I don’t have much data to back this up at the moment because I move through stuff as fast as possible.

AZdecrypt

 
Posted : July 26, 2015 5:05 pm
daikon
(@daikon)
Posts: 179
Estimable Member
 

In my own cynical opinion, many of these papers were written more as a goal of obtaining a degree by writing a thesis than actually making an effort at effective cipher solving. Which is understandable, but my personal opinion is that the best work is being done by interested amateurs working on the ciphers because they like to.

Or the method offered in the paper is so impractical, that it can’t be used for practical real-world cases. Like that paper that claimed "fast" solving method that used 128-core computer to crack 64-letter ciphers in "mere minutes". I came across another paper that claimed good results, until I saw that they were using 7- and 8-grams to score the solution. Well, you need an array with 26^8 = 208,827,064,576 elements to store the 8-gram stats in memory. Or over 200 Gb of RAM if you are using one byte to store the logs. And you need a lot more to actually build the table from a corpus, although in that case it doesn’t have to be in the memory. Must be nice to have access to a 128-core supercomputer with 256 Gb or RAM, but for all practical purposes, they might as well be using unicorns to solve the ciphers for them. πŸ™‚

 
Posted : July 26, 2015 10:00 pm
Jarlve
(@jarlve)
Posts: 2547
Famed Member
Topic starter
 

In my own cynical opinion, many of these papers were written more as a goal of obtaining a degree by writing a thesis than actually making an effort at effective cipher solving. Which is understandable, but my personal opinion is that the best work is being done by interested amateurs working on the ciphers because they like to.

Or the method offered in the paper is so impractical, that it can’t be used for practical real-world cases. Like that paper that claimed "fast" solving method that used 128-core computer to crack 64-letter ciphers in "mere minutes". I came across another paper that claimed good results, until I saw that they were using 7- and 8-grams to score the solution. Well, you need an array with 26^8 = 208,827,064,576 elements to store the 8-gram stats in memory. Or over 200 Gb of RAM if you are using one byte to store the logs. And you need a lot more to actually build the table from a corpus, although in that case it doesn’t have to be in the memory. Must be nice to have access to a 128-core supercomputer with 256 Gb or RAM, but for all practical purposes, they might as well be using unicorns to solve the ciphers for them. :)

I have experimented with the range of 3 to 10-grams (at the same time). It’s not highly unpractical when just trying to solve one cipher, I used a 2-dimensional array with some indexing and short-circuiting. Algorithm still pushed 10-20k keys per second and could have optimized further.

Next year we will get AMD’s Zen (16 core, 32 threads) and probably Intel’s (response) MorphCore (8 threads per core). Can’t wait to upgrade to one of these new platforms.

@daikon, you mentioned compiling 6-grams from a 1 gb corpus somewhere. I’m looking for a large, high quality corpus as big as possible to compile n-grams from so I’m wondering what material you used. I’ve been looking at this page: http://corpora2.informatik.uni-leipzig.de/download.html The issue I have with this data is that the sentences are not anymore in their original context (have been sorted).

AZdecrypt

 
Posted : July 27, 2015 12:30 pm
daikon
(@daikon)
Posts: 179
Estimable Member
 

I have experimented with the range of 3 to 10-grams (at the same time). It’s not highly unpractical when just trying to solve one cipher, I used a 2-dimensional array with some indexing and short-circuiting. Algorithm still pushed 10-20k keys per second and could have optimized further.

I would love to know more about how you stored the higher order N-grams in the memory! I tried using hash maps and binary/ternary trees, but it’s just too slow and the memory savings are not that significant over a straight array. Could you describe your idea in more detail? Or you could even just PM me the lookup subroutine and I should be able to figure it out from the source code.

Next year we will get AMD’s Zen (16 core, 32 threads) and probably Intel’s (response) MorphCore (8 threads per core). Can’t wait to upgrade to one of these new platforms.

Number of threads per core doesn’t really matter if you load the cores 100% all the time. It only matters if you have a lot of relatively short-lived threads, or if they are waiting a lot for I/O operations (such as reading/writing from/to disk, or sending/receiving networks data, etc.). So it would make sense for database type of applications, or web/media/email serving, or even having a bunch of small virtual servers on one super-server. But once you start fully loading the cores for more than a few seconds at a time (say, when cracking a cipher), it doesn’t matter how many threads it can accommodate β€” if it’s loaded 100% by one thread, it’s loaded 100%, no room for any other threads.

I was thinking about using OpenCL to offload some calculations to GPUs. They do that a lot for scientific research. GPUs don’t have as high clock rate as CPUs, I think it’s about 1/3rd on average, but the local memory access is super-fast, and GPUs have a lot of cores. Modern high-end GPUs can have up to 4,000 cores, and you can get an older Radeon with 800 cores for about $100. And you can even place several of them in one computer, only limited by the number of PCI slots (up to 6, I think) and the rating of your power supply. So technically, it would be possible to be build a monster with 6*4,000 = 24,000 cores of GPU power for about $10,000, I think. However, it would still take an eternity and then some to brute force all keys for Z340. πŸ™‚

@daikon, you mentioned compiling 6-grams from a 1 gb corpus somewhere. I’m looking for a large, high quality corpus as big as possible to compile n-grams from so I’m wondering what material you used.

I used the same one you have linked to below (I found the link on the excellent Practical Cryptography website). There is also the Usenet corpus. It’s definitely not "high quality", as it’s a bunch of random dudes posting their ramblings in a bunch of discussion groups on various topics, not unlike this forum :). But I think it’s a good thing, since it has more samples of informal speech, with lot’s of grammer errars and typoes :), which riddled Z’s letters as well, so it’s a good representation of what we are trying to crack (I think). However, it’s 8Gb compressed, and nearly 40Gb uncompressed though, so it takes a long while to process it. It’s a good candidate for higher order N-grams though due to its vast size, once I figure out how to fit 8/9/10-grams into memory (with your help, I hope!). They did a lot of redundant text removal, and they also removed all non-English text, and most of the technical headers, but they kept the quotation "preface" for some reason ("Aaaaaa Aaaaaaa in reply to Bbbbbbb Bbbbbbb on the date of zzzzz-zzzzz wrote/said:").

Interestingly enough, turns out Usenet Corpus has enough Italian in it to automatically crack the 3rd stage of the old Simon Singh’s cipher challenge. I was searching for examples of homophonic substitution ciphers, came across it, ran though a simple auto-solver of mine without much analysis, which produced a stable solve, which I wasn’t able to read/understand. I was intrigued, searched for the correct solution, which turned out to be in Italian, and it matched the automated solution pretty closely.

I’ve been looking at this page: http://corpora2.informatik.uni-leipzig.de/download.html The issue I have with this data is that the sentences are not anymore in their original context (have been sorted).

I thought that it was bad at first too, but now I think it’s good. You do want to separate sentences from one another as most sentences can be swapped around without too much loss of coherency in the overall meaning of the text. But most words within English sentences have very defined places. Or at least groups of words. For example, you can also potentially rewrite the previous sentence as "But within English sentences most words have places that are very defined". However, you’ll never write it as "But words most English within…". Unless you start speaking like Yoda. πŸ™‚

 
Posted : July 27, 2015 10:09 pm
Jarlve
(@jarlve)
Posts: 2547
Famed Member
Topic starter
 

Hey daikon,

I just used a 2-dimensional array. Array(count_of_ngrams,9), the second dimension are the 10 numbers which make up the n-gram. Order them alphabetically and you can use some indexing. Granted I only compiled them from a 100mb corpus but maybe you could throw out the lower counts until it fits your memory.

MorphCore (8-way SMT) is said to have a 22% performance increase over Hyperthreading (2-way SMT). Intelligent sharing of resources and whatnot I guess.

GPU use is very exciting!

Thanks for the Usenet corpus link. It’s amazing that you got a solve on the Italian cipher, it’s included with ZKDecrypto as well I believe. I suspected you were working on a solver of yourself, it’s allot of fun and very rewarding at times! Planning to stomp our current solvers?

:)

AZdecrypt

 
Posted : July 28, 2015 12:32 pm
daikon
(@daikon)
Posts: 179
Estimable Member
 

I just used a 2-dimensional array. Array(count_of_ngrams,9), the second dimension are the 10 numbers which make up the n-gram. Order them alphabetically and you can use some indexing. Granted I only compiled them from a 100mb corpus but maybe you could throw out the lower counts until it fits your memory.

Ah, I see! You mean you literally put all N-grams into one long "flat" list, ordered alphabetically, indexed by the first 3-4 letters to quickly find the starting point in the list and then did a binary search for the remainder of the N-gram? That should work. Definitely not the fastest, but the memory savings should be pretty good. You can even discard the first letters that you indexed the list by, so save some more memory.

GPU use is very exciting!

Funnily enough, just a couple of days after I thought of that, I came across another technical paper of someone doing just that back in 2013: Decipherment with a Million Random Restarts. (By the way, why the hell do they never put the first-published date in the paper?! I have to google the title of the paper each time to find out how recent it was.) They use a different approach β€” Expectation–maximization (EM) algorithm β€” running on three GPUs, to do a million restarts to try to crack Z340. Might be the best proof yet, that Z340 is not a straight homophonic substitution cipher.

I suspected you were working on a solver of yourself, it’s allot of fun and very rewarding at times! Planning to stomp our current solvers?

Well, I was pretty happy with ZDK and AZD, until I had the idea of doing a hill-climb for both homophonic substitutions and column transpositions. It’s the current working theory for me β€” that Z simply wrote down his message in the same matrix of 17 letters per row, but then literally cut up the paper with scissors into one-letter columns and rearranged those vertical strips using some sort of rule that made sense to him. I don’t have any real proof to back up this theory, other than that Z clearly wasn’t very patient or at all meticulous, so I can’t see him constructing a 26×26 Vigenere table, or using Polybius square and then fiddling with bifid, or even Playfair with its somewhat complicated rules, but rearranging of paper strips is so dead simple, even a toddle could do it. Another hint that he might’ve done it that way β€” it’s easy to choose a poor way to rearrange the strips, so that it does not destroy the bigram counts as well as bifid or vigenere would. I.e. I think if Z used bifid or vig, we’d see much fewer bigram repeats in Z340. But since we do see plenty of bigram repeats in Z340, but nowhere near enough for a straight homophonic substitution, I think he mixed up the columns just enough to confuse any manual, pen-and-paper decipherment attempts. And yet another hunch β€” the Military Cryptanalysis book lists columnar transpositions as one of the few encryption methods that don’t have a general solution. Which means you have to rely on cribs, or multiple ciphers that start the same, or end the same, or have the same length, etc., to recover the plaintext (i.e. using special solutions). If Z was in the Army/Navy, he would’ve trained on that book, and would’ve known that.

One of the main problems with columnar transpositions, is that there are just too damn many ways of doing it. I’ve tested all possible permutations of up to 7, which was already 2! + 3! + 4! + 5! + 6! + 7! = well over 6,000 different ways. Then I realized that if he did rearrange the columns, it was probably across all 17 of them. And you have 17! = 355,687,428,096,000 possible ways of doing that. As you can see, brute-forcing it won’t be possible. I thought that I could filter the candidates by throwing away all ciphers that don’t increase the number of bigram counts over the original. Well, turns out even with 8-column rearrangements, full 5% of the resulting ciphers had higher bigram counts. Extrapolating to 17… well, even 1% of 17! is still a huge number, whatever it is. πŸ™‚

So finally I had the idea of doing simultaneous hill-climb over homophonic substitutions *and* 17-column transpositions together, to see if it could be doable. I tried modifying the zkdecripto-lite source code, but I was spending too much time trying to understand what it is doing in several places (probably later optimizations, which obfuscated the original algorithm). So I ended up writing my own hill-climber, which was surprisingly easy, but it was way to finicky due to multiple parameters involved in solving a given cipher. I would get it to solve one cipher reliably, but when I try it on a different cipher, it would hang up too often. After adjusting the parameters, it would solve the new cipher, but it starts getting confused on the previous one. So I implemented the Simulated Annealing algorithm, and it is currently my go-to method, as it is fast and pretty robust and doesn’t need to know anything about homophone distributions (i.e. how many homophones to allocate per each plaintext letter).

So I was finally able to try my idea of doing 17-column transpositions at the same time as letter substitutions, but Simulated Annealing just does too well of a job of optimizing the result. I created a test cipher, and it routinely comes up with solutions that score much better than the actual plaintext. Which makes sense β€” if you shuffle 17 columns long enough you can probably rearrange letters in such a way that would spell all sorts of things that will score highly. But that’s with 6-grams. I believe it might just work with 7- or 8-gram stats from a bigger corpus. So either I need to get more memory for my computer, or find a way to fit higher N-grams into memory.

I also implemented the Beam Tree Search from the recent Malte Nuhn’s paper, but I couldn’t get it to solve even Z408. He must be doing something else in his algorithm, that he didn’t mention in the paper. Even thinking logically, I can’t actually see how it would solve any higher-multiplicity ciphers without either a lot of luck being the deciding factor, or an extensive manual analysis and multiple restarts. I’ve also implemented Monte Carlo Tree Search from a different paper, but it quickly gobbles up all available memory before it gets even close to solving Z408, and I allowed it to go all the way up to 4Gb. I also tried Genetic Algorithm, but it’s much slower than Simulated Annealing, and it gets stuck on local maximums way too easy. It does solve Z408, but it takes a lot of restarts and each restart takes much longer to finish, compared to SA. So far, Simulated Annealing is the king, at least in my book. The only 2 algorithms that I haven’t tried implementing is Bayesian Inference from Kevin Knight’s paper and Expectation–maximization (EM) algorithm from the paper I mentioned above. I tried to understand it, but the maths involved are just way above my pay grade :). I so wish that people would actually include the actual source code for the implementation of their methods, in addition to describing them mathematically, so that not only the scholars involved in the same field, but also laymen can use their invented methods for practical purposes. Is everyone just too shy about the quality/clarity of their code?! I just wonder what’s the point of sharing your idea with the world, if only a selected few can understand it enough to be able to use it to solve real-world problems? And I really don’t want to spend several weeks reading statistics textbooks, just to be able to understand the method, so that I can implement it, and finally confirm that it doesn’t actually help me with my problem. But I digress. πŸ™‚

 
Posted : July 29, 2015 3:16 am
Jarlve
(@jarlve)
Posts: 2547
Famed Member
Topic starter
 

Thanks for the indexing idea.

About columnar transposition (for a 17×20 grid). I’ve done some work on this. I don’t think it’s actual after homophonic encoding because when done properly it usually destroys bigram counts. I’ve also done another test with my non-repeats which is a little bit more vague and it tested negative. I may be wrong but if it’s actual then I believe it was done before encoding, so yeah, certainly don’t want to discount the possibility at the moment.

Somewhere at the end of last year I wrote a secondary hill climber for columnar transposition w/o any additional intelligence. I was able to retrieve columnar transposition for up to 10 columns with a searchspace reduction factor of about 3 or 4. Not very impressive, the problem is that it’s very hard to hill climb, the solution is very spikey. It’s sits on a telephone pole in hill climber terms. I’m not sure if higher order n-grams will help much. Some ideas that I have, is to implement a special anagram algorithm to the rows and combine scores. Another one is to use some form of skip-grams, where not every letter has to be a match.

I’ve also seen some higher scoring results from random columnar transposition experiments on the 340, but that was with 4-grams from a limited corpus. I should try it with my new solver, it’s much less prone to score variations. I believe in 5 years or so we will be able to brute force the permutations.

Thanks for the Malte Nuhn link. I’ve tried beam search for a while, even extended the idea to using cones and whatnot. It’s essentially something in between stochastic and steepest ascent hill climbing. I currently believe it has no place with hill climbing algorithms like Simulated Annealing but I may want to try a couple more ideas in the future.

Math is supposed to be a universal language right? Often, what is actually going in is not difficult at all, it’s just in another language.

I think your doing great work btw!

AZdecrypt

 
Posted : July 29, 2015 12:34 pm
Jarlve
(@jarlve)
Posts: 2547
Famed Member
Topic starter
 

Just want to note that smokie’s latest work on the ciphers in viewtopic.php?f=81&t=267&start=160 (his last post) also seems to add evidence against columnar transposition being actual after encoding for the 340. It seems to juggle the cycles around, though it’s certainly not conclusive until someones decides to do a really big test to see what actually happens to the cycles with columnar transposition after encoding.

AZdecrypt

 
Posted : July 29, 2015 1:13 pm
doranchak
(@doranchak)
Posts: 2614
Member Admin
 

So either I need to get more memory for my computer, or find a way to fit higher N-grams into memory.

I’ve been thinking about that problem, too, especially when considering Google’s humungous n-gram datasets (available for download here http://storage.googleapis.com/books/ngr … etsv2.html). I think the Google n-gram data would be very useful for cribbing, particularly when you are looking at sections of cipher text that have many repeated symbols ( http://www.zodiackillerciphers.com/?p=144 ). But how to deal with the massive amount of data? One way could be an array of solid state drives, with something like BigTable, Hadoop or Apache Cassandra handling distributed searches. Might not be as fast as RAM but still much faster than traditional disk I/O. Or maybe there is a similar method for pooling RAM across multiple computers, treating them all as one huge RAM array.

I also wonder about compressing the ngram data in RAM. A byte can hold 256 different values but we only care about 26 different values of a letter of an ngram. Can we apply something like modular arithmetic to encode more than one letter per byte? Or perhaps a more general kind of RAM compression can be applied to increase the effective size of RAM.

I so wish that people would actually include the actual source code for the implementation of their methods, in addition to describing them mathematically, so that not only the scholars involved in the same field, but also laymen can use their invented methods for practical purposes. Is everyone just too shy about the quality/clarity of their code?! I just wonder what’s the point of sharing your idea with the world, if only a selected few can understand it enough to be able to use it to solve real-world problems?

Have you contacted the authors? They are often glad to share. But I guess many that do the actual coding are graduate students, and feel pressure to keep their source code for different reasons (future projects, commercialization, or maybe just not wanting to document/clean up the code). Personally, when people ask for my code, I feel automatically obligated to clean it up and make it presentable, because it frequently isn’t. And that takes time and effort. Still, I find that many researchers are happy to share, especially if you give them a good reason or show them how interested you are in their work.

http://zodiackillerciphers.com

 
Posted : July 29, 2015 2:39 pm
Page 3 / 5
Share: