Is there a need for an OpenSource solver?

Largo · 2018-10-04T00:06:31Z

Hi, I've spent a lot of my spare time working on z340 and my tools over the last few weeks. From now on, other things have priority and I will spend less time with z340. I will not stop working on z340, but I have only a limited amount of time. My next attempts will be limited to a lot of automatic tests that can simply run in the background. It doesn't take much time to implement them. As already mentioned in other threads, I will now concentrate only on simple manipulations and interlocked ciphers. I have put a lot of work into my solver. Thanks again to Jarlve who gave me a lot of good tips and always had time for questions. Now it would be a pity if this work could not be used by others. Long story short: Are there people here who are interested in the source code? I am willing to share my work on github or bitbucket. If so, I would refactor it, comment it, and sweep it in the corners again. However, I only want to make this effort if there is enough interest. I will continue to work on the solver and would be happy if other people could contribute as well. What exactly does the project offer? - Solver for homophonic substitutions, developed in C#. - For a 340 character long cipher with 500000 iterations about 480 milliseconds are needed to find a usable solution (i7 8700 single thread). - The solverate is between 25 and 80% per restart, depending on the complexity of the cipher (one restart takes the 480 milliseconds just mentioned). - I support multithreading, but have little experience in this area. My code works and many solves can be done in parallel. However, my implementation may not be the most skillful. - There is source code that can be used to automatically create nGram files from any text source. You only have to specify a list of text files, the rest is done automatically. This is useful if you want to create your own nGram corpus. - So far only 5grams are supported. I have deliberately refrained from keeping the code generic. I don't like that, but it gives a huge performance gain. Before I clean up my source code, writing comments and create a public git repository: Is anybody interested?

f.reichmann

(@f-reichmann)

Posts: 30

Eminent Member

Thank you for submitting your code. In particular, it has the comments exactly where needed.

For the algorithm from reading the code, I understand:
– The provided 5-gram frequency table is read into memory, and initialized to the logarithm of their empirical likelihoods. Unseen n-grams in corpus get a default of 1/(100*#samples in corpus), to avoid the logarithm of 0.
– The score is computed as the sum of logarithms from the table, multiplied by (1+IC*ICweight), and ICweight is set to 7.
– The new score is accepted, if it is higher, or on a random basis if a random number is in exp(newscore-oldscore/temperature). temperature starts at 27, and decays in even quota of a cooling rate, which is initialized as target iterations at a value of 50000, down to 1, so that it becomes exponentially less likely to accept a downhill step with attempts getting closer to the target iterations.
– Letter frequencies are kept to the language standard by choosing random from a table with a standard distribution
– Next round, until temperature sinks to 1 or below, then print result.

Questions:
* How about to move to standard ISO C++ in order to keep the code portable?
* How about including the code to build the frequency table from input text, so that it can be applied to historic text, or different languages? Including the hard-coded letter frequencies table.
* How about giving priority to readable code, instead than fast code? After 50 years, execution speed of the successful code will be of low relevance. But designing the algorithms will be crucial. We could take advantage of the STL.
* What is the motivation behind the ICweight of 7? Why not a different value?
* What is the motivation behind choosing 1/100 of corpus samples for the likelihood of unseen n-grams, why not a different value?
* The algorithm plays two honeypots against each other. Sum of logs of likelihoods of 5-grams makes it attractive to collect the most frequent words in repetition, ideally self-overlapping (like THESETHESETHESE). IC is essentially the unigram distribution skewness. So it looks for the most popular 5-grams in a string that complies to unigram skewness, but without comparing to the unigram distribution itself. The link between the two is 7.

Although it proves to work and zkdecrypto has a similar approach with including entropy, I do not fully understand why it must work. The algorithm gets its justification from that it worked for what it was tried with. Making the algorithm work well for what we anticipate, may make it blunt for the uncertain that it shall find. That is a generic challenge, and seems impossible to avoid. However, I feel we should document the design criteria of the algorithms, to understand what they might be blind for.

Posted : April 21, 2019 2:30 pm

f.reichmann

(@f-reichmann)

Posts: 30

Eminent Member

A documentation of experiments on how to provide a better scoring algorithm for solving.

Disclaimer at the beginning: None of them is superior to the heuristic approaches that are known from zkdecrypto, HomophonicSolver, or AZDecrypto (as much as visible). Just that the mathematical reasoning sounds more solid at least in my ears.

Few observations:
1. Only for the letter E the n-gram frequency is double-digit. For 3-grams, only THE is above 1%. So that we are in a "small p" scenario.
2. The ciphers of focus are several hundred letters long, 340 and 408.
3. IoC and Entropy are stable numbers with low variance. Entropy for bigrams is lower than the double of unigrams, etc, reflecting the next letter to statistically depend on the previous. IoC and Entropy would happily have the same values if e.g. X and E were swapped, so that it might be more attractive to directly compare to the distribution.
4. Long n-grams naturally simulate the conditional likelihood for following letters for shorter n-grams, and starts of new words.

Attempt 1: Chi2 statistic over all of 1-grams, 2-grams, 3-grams, 4-grams, 5-grams.
In short, the (b-x)^2/x sum leads to the scoring relevant function
b*(b-2)/x
(with b as observed count, and x as expected) to compare between new and old, and fails miserably, due to the low likelihoods associated to every n-gram. For good results, the expected value is 1-5, depending on literature. With sub-promille likelihoods, and 340 characters, the n-grams need to be summed in baskets to achieve an expected value above 1, which makes the tool blurry.

Attempt 2: Poisson Distribution
Very attractive, because Poisson has only one input, the expected mean, which is easy to compute. The conditions of small p (below 5%), and large n (above 50) are met, and violated for the fact that the next letter depends on the previous (which is a local effect). Calculating the likelihood of an observed distribution can be simplified:
log(P(X=B))
=Sum(all possible N-Grams, log(P(X=bi))
=Sum(all possible N-Grams, log(P(X=bi))-Sum(all possible N-Grams, log(P(X=0))+Sum(all possible N-Grams, log(P(X=0))
=Sum(all observed N-Grams, log(P(X=bi))-Sum(all observed N-Grams, log(P(X=0))+Sum(all possible N-Grams, log(P(X=0))
In the last line, the last term is constant, and can be left out between iterations, drastically reducing the summation to run only over the observed n-grams, instead of all in the corpus. The first term is as well what zkdecrypto, HomophonicSolver, and to my knowledge AZDecrypto do from a heuristic approach, to sum up the log of the likelihoods. The second value is missing in all scoring functions of solvers that I have seen so far.
Putting in the Poisson definition, the scoring function is
b*log(p)-log(b!)
summed over all observation counts b and their expected likelihoods p from the corpus. n-grams that have never been seen can be said to have not been seen in #samples in the corpus, maybe they would be seen one in another #samples as a minimum knowledge argument, meaning the likelihood for double of samples would be 1/2, and put that into Poisson means the minimum log likelihood is log(2)/#samples (which I find better reasoned than ad-hoc of 1/100). Current solvers miss the "-log(b!)". For long n-grams, b is typically 1, so that the error is negligible, and only relevant for unigrams.
The result is a slow convergence, for ciphers long enough and not too many cipher symbols. Works in principle.

Attempt 3: Gauß distribution
Almost as attractive and more generic, the Gauß distribution is the asymptotic behaviour of the Binomial distribution for large n and large p. The approximation for the Binomial distribution á la Laplace and Gauß is B(n,p,k)=1/sqrt(2*PI*n*p*q)*exp(-0,5*(k-np)^2/(n*p*q)), and leads to the scoring function
b*(2*n*p-b)/(n*p*(1-p))
where n is the length of the cipher minus n-gram length plus one. Sorry for double-booking n here. And works only for short n-grams, if at all. Little surpise retrospecively, because the Laplace approximation of the Binomial distribution is only good for n*p*q>9, and this condition is almost always violated in the frequencies of the language corpus, except in unigrams.

Attempt 4: Binomial distribution
B(n,p,k)=(n over k)*p^k*q^(n-k). There is no relevant simplification possible, so that I just downloaded the Apache math library, called the logBinomial, ét voila, it more or less works at least slowly. Tendency to swap letters to increase the likelihood but look less readable. Computationally heavy in particular in Java, but who cares after 50 years.
Seems unigrams are more or less following a Binomial distribution.

So long story short, some comments on the solver functions:
1. The ad-hoc algorithms documented in zkdecrypto, HomophonicSolver, and maybe AZDecrypto, receive their late justification with being equivalent to performing a Poisson type likelihood calculation in logarithm, and all seem to lack reducing the score with log(b!), which is important only if b is anticipated above 1, like in unigrams. Possible due to the small error and practically only relevant for short n-grams, this has not been corrected through try and err so far.
2. None of my attempts above works better than the ad-hoc methods that have been discussed here before (zkdecrypto, AZDecrypto), on the standard texts I am using (Martin Luther King, and other).
3. It is possible to build un-biased scoring functions from statistical arguments, rather than engineering with try-and-err-and-tune. This gives myself more trust, because less fear to implicitly take assumptions that may not be true, and hinder the solution.

Personally I think it is worth investing thoughts into the scoring function. If there was a way to solve with only unigrams, then the transposition discussion could kick off from a plain-text cross-word puzzle, instead of wild-guessing on an encrypted one. The less assumptions, explicit or implicit, are in the scorer function, the broader it will find the solutions.

And last but not least: If homophonic solvers do not solve and otherwise solve pretty much anything else, then chances are it is not a homophonic cipher. I am sure this insight was had before already.

What I could do next is to compute the full histograms of the corpus, and compare it to Poisson and Binomial, to justify which one to use.

Posted : April 22, 2019 1:46 am

Largo

(@largo)

Posts: 454

Honorable Member

Topic starter

Hi f.reichmann,

I’m just taking a break from z340 again, so only a few short answers for now. But thank you in advance for your interest and the good suggestions you made:

For the algorithm from reading the code, I understand:
…
…

That’s a very good summary. The values for the start temperature of 27 and the target iterations of 500000 have been found to be quite useful by various tests. Depending on the cipher other values may be useful. z408 can be solved with only a few iterations (40k) (if you are lucky).

How about to move to standard ISO C++ in order to keep the code portable?

That’s a good idea. I turned my back on C++ years ago, but I can’t deny that it makes sense in terms of portability.

How about including the code to build the frequency table from input text, so that it can be applied to historic text, or different languages? Including the hard-coded letter frequencies table.

That’s a very good suggestion. Generally nothing should be hardcoded at all, everything should be configurable. But then we should be able to save the generated tables as a text file, so that they don’t have to be rebuilt every time we start the solver. Also, we should allow to completely disable the use of the frequency tables.

How about giving priority to readable code, instead than fast code? After 50 years, execution speed of the successful code will be of low relevance. But designing the algorithms will be crucial. We could take advantage of the STL.

I think the speed of execution has high priority. I ran tests that took a few days and tested millions of permutations / transpositions. A few milliseconds difference per run of the solver can make up days or weeks at the end. The "core" of the solver is also manageable and has comparatively few lines of code. In my opinion, detailed comments are sufficient. I haven’t worked with the STL for a long time. I remember it as very performant, but manually optimized low-level code should be faster. But I gladly let the opposite convince me.

What is the motivation behind the ICweight of 7? Why not a different value?

Jarlve recommended this value to me. I seem to remember that it was created by empirical testing and proved to be useful.

What is the motivation behind choosing 1/100 of corpus samples for the likelihood of unseen n-grams, why not a different value?

I’m not sure what you mean. Can you please post the relevant position in the code?

Making the algorithm work well for what we anticipate, may make it blunt for the uncertain that it shall find. That is a generic challenge, and seems impossible to avoid. However, I feel we should document the design criteria of the algorithms, to understand what they might be blind for.

You’re absolutely right about that. However such a solver is by nature very target-oriented. Assuming that z340 is a homophonic substitution with a certain "extra", you align the solver to that assumption. In my opinion, generic approaches only work to a certain degree here. Since I can’t describe it any other way, here’s a metaphor:
Imagine walking through an orchard looking for fruit. Ideally you would like apples or maybe pears. If you come across a plum tree, you can also be satisfied. You may not have expected it or may not even know this type of fruit, but you have reached your goal (to find fruit). However, if there are no trees on the meadow, you will starve. The strawberries that grow on the ground are overlooked because you only look upwards.
Stupid comparison, I know. But a certain limitation of the expected result has to be made.

I hope that I will soon have more energy for z340 again.

Translated with http://www.DeepL.com/Translator

Posted : April 22, 2019 3:23 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

A documentation of experiments on how to provide a better scoring algorithm for solving.

Thanks for your extensive post. Most of the stuff you talk about is over my head but I am interested.

1. The ad-hoc algorithms documented in zkdecrypto, HomophonicSolver, and maybe AZDecrypto, receive their late justification with being equivalent to performing a Poisson type likelihood calculation in logarithm, and all seem to lack reducing the score with log(b!), which is important only if b is anticipated above 1, like in unigrams. Possible due to the small error and practically only relevant for short n-grams, this has not been corrected through try and err so far.

How the AZdecrypt scoring function works:

1. Sum the n-gram logs of the solution plain text into a score (higher is better).
2. Divide the score by the total amount of n-grams in the solution plain text (normalization).
3. Divide the score by the IOC in such a way that a higher IOC lowers the score to keep the solution plain text away from stuff like "THESETHESETHESE" as you mentioned.

Can you explain in layman terms what exactly is missing? What is log(b!) and where does it fit in?

Thanks!

AZdecrypt

Posted : April 28, 2019 10:29 pm

f.reichmann

(@f-reichmann)

Posts: 30

Eminent Member

Hi Jarlve,

thank you for your summary. I saw that you released the AZD source code. My deep respect, and thankfulness to you. I still need to analyse and understand, and your summary of the algorithm helps me a lot in that.

To motivate the log(b!), I have a mathematical argumentation asking for the maximum likelihood of an observed n-gram in a real text. Lets assume that we have a huge dictionary, and we know the statistical distribution of all possible n-grams (for now, let’s assume with no n-gram occurring 0 times, for simplicity. I will work around the issue later that some n-grams never occurr).

Let’s take n-grams that are not too short, then each of these have a low (e.g. 5-grams, the most likely is one percent, and the rest drops off to far below) likelihood of occurrence. And the cipher is 340 characters long, so we make 340-n "experiments". In statistics, to calculate the distribution, there is an approximation called Poisson. It says, that for a experiment of likelihood p_i, that is repeated N times, the likelihood of observing b_i occurrences of the n-gram number i, equals to P(X=b_i)=(((N*p_i)^b_i)/b_i!)*exp(-N*p_i). See https://en.wikipedia.org/wiki/Poisson_distribution for a nicer formatting of the formula. This approximation is valid for independent events that occur seldom, with many repeated attempts, and n-grams (except for 1-grams, and except we are not fully independent) more or less comply to these conditions.

Now, to observe the n-grams b_1, b_2, b_3, … b_(N-n), where N=340 for Z340, we would need to take a product over all i of P(X=b_i)=((N*p_i)^b_i/b_i!)*exp(-(N*p_i)), product over all n-grams in our dictionary. Numerically, this will be a tiny number, and unstable in floating point arithmetic. So let’s take the logarithm of it to turn it into a sum of log((((N*p_i)^b_i)/b_i!)*exp(-(N*p_i))), for all n-grams in our dictionary.

Because log(a*b)= log(a)+log(b), and log(a^b)=b*log(a), and log(exp(a))=a, this can be simplified to the sum over all i of log(P(X=b_i))=b_i*log(N)+b_i*log(p_i)-log(b_i!)-(N*p_i).

Now, lets look at these terms more closely. log(N) is a constant, depending on the cipher text length. The sum over all b_i is a constant equal to N-n (cipher length minus the length of the n-gram). So that the first term b_i*log(N) is constant, and for comparing two scores, we do not need to compute it.

The last term as sum over -(N*p_i) is a constant, and equals to -N, because the sum over all p_i must be 1. In a comparison between scores, we do not need to compute it.

Remains the sum of +b_i*log(p_i)-log(b_i!) as scoring function, over all i in the dictionary. For the very most of n-grams, b_i will be equal to zero, and because the faculty of zero equals to 1, and the logarithm of 1 is 0, the term of the function is 0 every time that b_i is 0. Good news, it means we do not need to sum up all entries in the dictionary, but only let i iterate over the n-grams that we observed in the dictionary.

The term b_i*log(p_i) is what HomophonicSolver, zkdecrypto, and AZD (to my understanding so far) do, before putting more statistical metrics like chi^2, IoC, and entropy to prevent the solution to repeat highly frequent n-grams, like THETHETHE in english. Despite the solver algorithms have been found heuristically to my knowledge, the summation of log(p_i) I see justifiable from a statistical mathematical point of view with above arguments.

But the -log(b_i!) I have seen in no implementation so far. And because b_i will be 1 in the very most cases and the logarithm of 1 is 0, it should not matter for the score of the true solution, but it will matter for an intermediate solution that repeats an n-gram too often, because it will punish THETHETHE when THE occurs more often than expected. Which is exactly the weakness in the summation that HS, zkd and AZD work around with chi^2, IoC and entropy, all of which effectively add an emphasis of the 1-gram statistic, using what looks like an experimental ad-hoc formula in order to balance. Including -log(b_i!) in the sum looks like the more systematic approach to me.

So far it sounds great, to my ears. Unfortunately my results are less shiny in practise. I tried the above, building the sum over all 1,2,3,4,5-grams. Apart from being slow in my Java code, it converges to only something similar like the solution, and I believe because the high standard deviation of the higher n-grams tend to "pollute" the 1-grams. The score function delivers better results when summing 1-grams, and 5-grams only, and in particular when giving the 1-grams 7.0 (copying Homophonic Solver) times the weight of the 5-grams. But then my hill-climber implementation is not yet as mature to find the global maximum, it gets stuck at a local maximum, so that I will copy the hill-climber implementation as well.

I want to try to justify the value of 7.0 as well, maybe with optimising the signal (the p_i) to noise (the std deviation, N*p_i*(1-p_i)). In the meanwhile, I expect the heuristic algorithms are more neutral to the specific solutions that I believed they would be at first glance. I still would like to have an algorithm that is understood to be neutral, which means it has no factors involved that have been found with try and err and tune. 7.0 is the last one to get rid of.

Now to the question what to do with n-grams that are never seen in the dictionary. I suggest to use a "zero knowledge" argument. I have not seen it in the dictionary of the given size M, with M being the number of letters in the dictionary, and assumed to be very large. If p is the likelihood of occurrence of the n-gram, then P(X=0)=exp(-M*p), with above Poisson distribution. I do not know if it might not occur in a dictionary of the double size, but if so, I expect it once at most, and estimate my chances to no better than 50:50, because I actually have no clue and no information at all. Then I estimate P(X=0)>1/2, which means p<log(2)/M. So that I put the likelihood of a n-gram that I have never seen, to log(2)/M. If I want to do better than that, then I need bigger dictionaries, which makes sense.

A sample run of the above algorithm mimicking the 408 with an even distribution of the cipher symbols is attached here:

Version 1.5.2019 11:43
Parallel threads: 1
Iterations per thread to run: unlimited
Filter for language alphabet: [^a-z]+
Filter to limit letters on inner disk: [a-z]+
Reading /home/fritz/git/jDecryptor/Texts/OANC.txt
Collected 68577404 characters of text sample.
Counting each of 1-grams with weight 7.0: 26 distinct found.
Counting each of 5-grams with weight 1.0: 1179370 distinct found.
Debug mode output start
Solution: COPIED,"[a-z]+",{[a:Pp+][b:0Qq][c:1Rr][d:2Ss][e:3Tt][f:4Uu][g:5Vv][h:6Ww][i:7Xx][j:8Yy][k:9Zz][l:Aa][m:Bb][n:Cc][o:Dd][p:Ee][q:Ff][r:Gg][s:Hh][t:Ii][u:Jj][v:Kk][w:Ll][x:Mm][y:Nn][z:Oo]}
Encrypted: xa793z7aAxCvEtDeAtQ31+Jh37ixhHdBJRwUjcXixHBdGTUjCI6pc9XaAXCVL7A2vpb37ciW3UDgtHi0t1PJhTbpc7hi6tbdhi2pC5tGdJhPCxBpaD4PaAIDz7aaHdB3iwXC55xktHB3IwtBDHII6GxaA7CvTMeTgxtCrtxixHtk3cQTIitGIWPc53IiXC5NDjGGdrZhduUl7iW+VxGaIw3QThIEPGId47IxHI6pILW3cxS7TxLXAaqtgtQDgCxce+G+S7RtPCSPAaIWtx6+k397aAT2LxAAQT1dBTbnHapK3hxlxaaCDIv7KTNdjBNCPBt0T1pJhTnDJL7aAiGNiDhaDl2DLcdghideBN1dAAT1I7C5duhA+kTHUDgBnPuITGAX4TTqTDgXtiTBTiW6e7i7
Cipher to operate: xa793z7aAxCvEtDeAtQ31+Jh37ixhHdBJRwUjcXixHBdGTUjCI6pc9XaAXCVL7A2vpb37ciW3UDgtHi0t1PJhTbpc7hi6tbdhi2pC5tGdJhPCxBpaD4PaAIDz7aaHdB3iwXC55xktHB3IwtBDHII6GxaA7CvTMeTgxtCrtxixHtk3cQTIitGIWPc53IiXC5NDjGGdrZhduUl7iW+VxGaIw3QThIEPGId47IxHI6pILW3cxS7TxLXAaqtgtQDgCxce+G+S7RtPCSPAaIWtx6+k397aAT2LxAAQT1dBTbnHapK3hxlxaaCDIv7KTNdjBNCPBt0T1pJhTnDJL7aAiGNiDhaDl2DLcdghideBN1dAAT1I7C5duhA+kTHUDgBnPuITGAX4TTqTDgXtiTBTiW6e7i7
Cipher symbols count: 54
Cipher symbols : [+, 0, 1, 2, 3, 4, 5, 6, 7, 9, A, B, C, D, E, G, H, I, J, K, L, M, N, P, Q, R, S, T, U, V, W, X, Z, a, b, c, d, e, g, h, i, j, k, l, n, p, q, r, t, u, v, w, x, z]
Random fraction: 2
Language alphabet: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z]
Language alphabet Size: 26
Decrypted: Score: -1,958561762397272E4 : ilikekillingpeoplebecauseitissomuchfunitismorefunthankillingwildgameintheforestbecausemanisthemostdangerousanimalofalltokillsomethinggivesmethemostthrillingexperienceitisevenbetterthangettingyourrocksoffwithagirlthebestpartofitisthatwhenidieiwillbereborninparadiceandalltheihavekilledwillbecomemyslavesiwillnotgiveyoumynamebecauseyouwilltrytoslowdownorstopmycollectingofslavesformyafterlifeebeorietemethhpiti
Test mode output end
Wed May 01 11:44:17 CEST 2019: Executor initialization completed.
Wed May 01 11:44:17 CEST 2019 Climber 0 Iterations: (1,0) Score: -2,367265676745303E4 RANDOM,"[a-z]+",{[a:][b:Sx9L][c:Pp][d:BU][e:A5J][f:EIjzK][g:g][h:Z][i:teV][j:a][k:Rl][l:bcTGh][m:7][n:12Ci][o:03WH][p:DX][q:Q][r:dN][s:][t:w+][u:v][v:4][w:qu][x:rkn][y:M][z:6]} bjmbofmjebnufipieiqontelomnblordektdflpnbodrlldfnfzclbpjepnibmenuclomlnoodpgionoincelllclmlnzilrlnncneilrelcnbdcjpvcjefpfmjjordontpneebxiodoftidpoffzlbjemnulyilgbinxibnboixolqlfnilfocleofnpnerpfllrxhlrwdkmnotibljftoqllffclfrvmfbofzcfboolbbmlbbpejwigiqpgnblitltbmkicnbcejfoibztxobmjelnbbeeqlnrdllxojcfolbkbjjnpfumflrrfdrncdiolncellxpebmjenlrnpljpknpblrglnridrnreelnfmnerwletxlodpgdxcwfllepvllwlpgpinldlnozimnm
...
Wed May 01 11:45:27 CEST 2019 Climber 0 Iterations: (68,390) Score: -1,952040649711423E4 COPIED,"[a-z]+",{[a:Pp+][b:0Q][c:1Rr][d:2S9][e:3Tt][f:Uu][g:5V][h:q6Ww][i:7Xx][j:][k:Z][l:Aa][m:B][n:Cc4][o:Dd][p:Ee][q:][r:bGg][s:vHh][t:IiK][u:Jj][v:k][w:zLl][x:M][y:Nn][z:]} ilidewillinspeoplebecauseitissomuchfunitismorefunthandillingwildsareintheforestbecauseranistherostdangerousanimalonalltowillsomethinggivesmethemostthrillinsexperienceitisevenbetterthangettingyourrocksoffwithagirlthebestpartonitisthatwhenidieiwillhereborninparadiceandalltheihavedilledwillbecomeryslatesiwillnotsiteyoumynamebecauseyouwilltrytoslowdownorstopmycollectingofslavesformyafterlineeheorietemethhpiti

With the real 408 with its much more inhomogeneous distribution, my implementation of the hill-climber has difficulties converging. I need to start 8 threads, for getting a quicker result, and thread 5 did was the quickest today:

Version 1.5.2019 11:43
Parallel threads: 8
Iterations per thread to run: unlimited
Filter for language alphabet: [^a-z]+
Filter to limit letters on inner disk: [a-z]+
Reading /home/fritz/git/jDecryptor/Texts/OANC.txt
Collected 68577404 characters of text sample.
Counting each of 1-grams with weight 7.0: 26 distinct found.
Counting each of 5-grams with weight 1.0: 1179370 distinct found.
Reading cipher from file /home/fritz/git/jDecryptor/Ciphers/408.zodiac.txt
Cipher from disk : 9%P/Z/UB%kOR=pX=BWV+eGYF69HP@K!qYeMJY^UIk7qTtNQYD5)S(/9#BPORAU%fRlqEk^LMZJdrpFHVWe8Y@+qGD9KI)6qX85zS(RNtIYElO8qGBTQS#BLd/P#B@XqEHMU^RRkcZKqpI)Wq!85LMr9#BPDR+j=6N(eEUHkFZcpOVWI5+tL)l^R6HI9DR_TYrde/@XJQAP5M8RUt%L)NVEKH=GrI!Jk598LMlNA)Z(PzUpkA9#BVW+VTtOP^=SrlfUe67DzG%%IMNk)ScE/9%%ZfAP#BVpeXqWq_F#8c+@9A9B%OT5RUc+_dYq_^SqWVZeGYKE_TYA9%#Lt_H!FBX9zXADd7L!=q_ed##6e5PORXQF%GcZ@JTtq_8JI+rBPQW6VEXr9WI6qEHM)=UIk
Cipher to operate: 9%P/Z/UB%kOR=pX=BWV+eGYF69HP@K!qYeMJY^UIk7qTtNQYD5)S(/9#BPORAU%fRlqEk^LMZJdrpFHVWe8Y@+qGD9KI)6qX85zS(RNtIYElO8qGBTQS#BLd/P#B@XqEHMU^RRkcZKqpI)Wq!85LMr9#BPDR+j=6N(eEUHkFZcpOVWI5+tL)l^R6HI9DR_TYrde/@XJQAP5M8RUt%L)NVEKH=GrI!Jk598LMlNA)Z(PzUpkA9#BVW+VTtOP^=SrlfUe67DzG%%IMNk)ScE/9%%ZfAP#BVpeXqWq_F#8c+@9A9B%OT5RUc+_dYq_^SqWVZeGYKE_TYA9%#Lt_H!FBX9zXADd7L!=q_ed##6e5PORXQF%GcZ@JTtq_8JI+rBPQW6VEXr9WI6qEHM)=UIk
Cipher symbols count: 54
Cipher symbols : [!, #, %, (, ), +, /, 5, 6, 7, 8, 9, =, @, A, B, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, , ^, _, c, d, e, f, j, k, l, p, q, r, t, z]
Random fraction: 2
Language alphabet: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z]
Language alphabet Size: 26
Wed May 01 11:53:43 CEST 2019: Executor initialization completed.
Wed May 01 11:53:43 CEST 2019 Climber 5 Iterations: (1,0) Score: -2,629932705334498E4 RANDOM,"[a-z]+",{[a:L][b:Uj][c:BRT][d:#f][e:!^][f:KN][g:F][h:A][i:][j:p][k:QE6J][l:@][m:][n:YZ][o:5][p:kM/][q:dW+][r:q=_][s:7][t:ct][u:Iz][v:][w:r(8][x:eX9][y:D)][z:PS%VGHlO]} xzzpnpbczpzcrjxrcqzqxzngkxzzlfernxpknebupsrctfknyoyzwpxdczzchbzdczrkpeapnkqwmjgzzqxwnlqrzyxfuykrxwouzwcftunkzzwrzcckzdcaqpzdclxrkzpbeccptnfrjuyqrewoapwxdczycqbrkmfwxkbzpgntjzzquoqtayzeckzuxycrcnwmqxplxkkhzopwcbtzayfzkfzrzwuekpoxwapzfhynwzubjphxdczqmqzctzzerzwzdbxksyuzzzupfpyztkpxzzndhzdczjxxrqrrgdwtqlxhxczzcocbtqrqnrrezrqznxznfkrcnhxzdatrzegcxxuxhyqmsaerrrxqddkxozzcxkgzztnlkctrrwkuqwczkqkzkxwxqukrkzpyrbup
...
Wed May 01 11:54:12 CEST 2019 Climber 5 Iterations: (13,83) Score: -1,973685141591260E4 COPIED,"[a-z]+",{[a:S7G8l][b:V][c:e][d:fz][e:pE6WZ+N][f:QJ][g:R][h:)M][i:PU9k][j:][k:/][l:B#%][m:q][n:D(^O][o:!TdX][p:=][q:][r:rt][s:@FjK][t:5HIL][u:Y][v:c][w:A][x:][y:_][z:]} ilikekillingpeoplebecauseitissomuchfunitiamorefunthankillingwildgameintheforrestbecausemanisthemoatdangertueanamalofalltokillsomethinggivesmethemoatthrillingesperenceitisevenbetterthangettingyourrocksoffwithagirlthebestpartofitiathaewhenidieiwillbereborninparadiceandalltheihavekilledwillbecomemyslavesiwillnotgiveyoumynamebecauseyouwilltrytosloidownoratopmycollectingofslavesformyafterlifeebeorietemethhpiti

The scoring function used, formulated in Java, for the above is

	public Score(final Map<Integer, LanguageStatistics> iNorm, final String iClear) throws Exception {
		Double aScore = 0.0;
		org.apache.commons.math3.analysis.function.Log aLog=new org.apache.commons.math3.analysis.function.Log();

		for (Integer aLength : iNorm.keySet()) {
			final Double aNeverSeen=aLog.value(2.0)/iNorm.get(aLength).getSamples().doubleValue();
			Double aLengthScore = 0.0;
			LanguageStatistics aObservedStatistics = new LanguageStatistics(aLength);
			aObservedStatistics.addStringAsSequences(iClear);

			Integer aN=iClear.length()-aLength+1;

			for (String aString:aObservedStatistics.getSequences()) {

				Double aP=iNorm.get(aLength).empiricalMean(aString);
				if (aP==0.0)
					aP=aNeverSeen;
				Integer aK=aObservedStatistics.getCount(aString);

//				Gauss
//				aLengthScore += aK*(2*aN*aP-aK)/(aN*aP*(1-aP)); // Only good for npq>9

//				Binomial
//				org.apache.commons.math3.distribution.BinomialDistribution aBinDist = new org.apache.commons.math3.distribution.BinomialDistribution(aN,aP);
//				aLengthScore += aBinDist.logProbability(aK)-aBinDist.logProbability(0);

//				Poisson
				aLengthScore += aK*aLog.value(aP)-org.apache.commons.math3.util.CombinatoricsUtils.factorialLog(aK);
			}
			
			aScore += iNorm.get(aLength).getWeight()*aLengthScore;

			if (GlobalStore.getInstance().getVerbose()) {
				if (_log.length()>0)
					_log.append(" ");
				_log.append(aLength
						+ ":" + iNorm.get(aLength).getWeight()
						+ "*" + aLengthScore
						+ "=" + iNorm.get(aLength).getWeight()*aLengthScore);
			}
		}
		_patternScore=aScore;		
	}

And that’s it. No IoC, no Chi^2, no entropy forcing the 1-gram distribution back to sanity is used.

The OANC dictionary is a 68M corpus from http://www.anc.org/ , so that it is actually quite small.

An attack on the 340 gives the usual gibberish, I have no sensations to offer, unfortunately.

Version 1.5.2019 11:43
Parallel threads: 8
Iterations per thread to run: unlimited
Filter for language alphabet: [^a-z]+
Filter to limit letters on inner disk: [a-z]+
Reading /home/fritz/git/jDecryptor/Texts/OANC.txt
Reading /home/fritz/git/jDecryptor/Texts/web.txt
Reading /home/fritz/git/jDecryptor/Texts/news.txt
Reading /home/fritz/git/jDecryptor/Texts/wiki.txt
Collected 354005589 characters of text sample.
Counting each of 1-grams with weight 7.0: 26 distinct found.
Counting each of 5-grams with weight 1.0: 2303851 distinct found.
Reading cipher from file /home/fritz/git/jDecryptor/Ciphers/340.zodiac.txt
Cipher from disk : HER>pl^VPk|1LTG2dNp+B(#O%DWY.<*Kf)By:cM+UZGW()L#zHJSpp7^l8*V3pO++RK2_9M+ztjd|5FP+&4k/p8R^FlO-*dCkF>2D(#5+Kq%;2UcXGV.zL|(G2Jfj#O+_NYz+@L9d<M+b+ZR2FBcyA64K-zlUV+^J+Op7<FBy-U+R/5tE|DYBpbTMKO2<clRJ|*5T4M.+&BFz69Sy#+N|5FBc(;8RlGFN^f524b.cV4t++yBX1*:49CE>VUZ5-+|c.3zBK(Op^.fMqG2RcT+L16C<+FlWB|)L++)WCzWcPOSHT/()p|FkdW<7tB_YOB*-Cc>MDHNpkSzZO8A|K;+
Cipher to operate: HER>pl^VPk|1LTG2dNp+B(#O%DWY.<*Kf)By:cM+UZGW()L#zHJSpp7^l8*V3pO++RK2_9M+ztjd|5FP+&4k/p8R^FlO-*dCkF>2D(#5+Kq%;2UcXGV.zL|(G2Jfj#O+_NYz+@L9d<M+b+ZR2FBcyA64K-zlUV+^J+Op7<FBy-U+R/5tE|DYBpbTMKO2<clRJ|*5T4M.+&BFz69Sy#+N|5FBc(;8RlGFN^f524b.cV4t++yBX1*:49CE>VUZ5-+|c.3zBK(Op^.fMqG2RcT+L16C<+FlWB|)L++)WCzWcPOSHT/()p|FkdW<7tB_YOB*-Cc>MDHNpkSzZO8A|K;+
Cipher symbols count: 63
Cipher symbols : [#, %, &, (, ), *, +, -, ., /, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, >, @, A, B, C, D, E, F, G, H, J, K, L, M, N, O, P, R, S, T, U, V, W, X, Y, Z, ^, _, b, c, d, f, j, k, l, p, q, t, y, z, |]
Random fraction: 2
Language alphabet: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z]
Language alphabet Size: 26
Wed May 01 12:10:23 CEST 2019: Executor initialization completed.
Wed May 01 12:10:23 CEST 2019 Climber 3 Iterations: (1,0) Score: -2,006941535094980E4 RANDOM,"[a-z]+",{[a:W|_][b:-][c:AD][d:][e:23X^][f:V7;][g:l][h:qB][i:fJz][j:4(O][k:G>][l:6][m:H][n:PT][o:EkM][p:yN][q:][r:@1U+][s:pR9YZjL][t:&):][u:<][v:b#t.][w:CSd58][x:c%/][y:K][z:F*]} mosksgefnoarsnkewpsrhjvjxcasvuzyithptxorrskajtsvimiwssfegwzfesjrrsyeasorivswawznrtjoxswsezgjbzwwozkecjvwryhxferxekfvisajkeiisvjrapsirrsswuorvrssezhxpcljybigrfreirjsfuzhpbrrsxwvoacshsvnoyjeuxgsiazwnjovrthzilswpvrpawzhxjfwsgkzpeiwejvvxfjvrrpherztjswokfrswbraxveihyjjseviohkesxnrsrlwurzgahatsrrtawiaxnjwmnxjtsazowaufvhasjhzbwxkocmpsowisjwcayfr
...
Wed May 01 12:11:20 CEST 2019 Climber 3 Iterations: (11,74) Score: -1,643309134369114E4 COPIED,"[a-z]+",{[a:@2S(9:Z][b:;][c:Gy][d:UL][e:Bt8+k<l][f:1][g:q][h:Ab7W][i:T6f|][j:][k:][l:Cc][m:P][n:5)*M/][o:VX-.>][p:J][q:][r:RzO][s:4D&jN^_][t:p#dFYK][u:3%][v:][w:E][x:][y:H][z:]} ywrotesomeifdicatsteeatrushtoentinecalnedachandtrypatthseenoutreertasanerestintmessentersterontletoasatnetgubadlocoordiacapistresstreadatenehearatelchistoredoesperthetecodernnewistethintraelerpinnisnoesetriaactesintelaberectssinasholoseeeceofnasalwoodanoeilouretartsoingcarliedfileeteheindeenhlrhlmrayinantitetheheestrenollonsysteararehitbe
Wed May 01 12:12:18 CEST 2019 Climber 7 Iterations: (23,152) Score: -1,642199613474405E4 COPIED,"[a-z]+",{[a:2S(9:Z][b:;][c:Gy][d:UL][e:Bt8+kl][f:1][g:q][h:AbW][i:T6f<|][j:][k:][l:Cc][m:P][n:@5)*M/][o:VX-.>][p:J][q:][r:RzO][s:4D&7jN^_][t:p#dFYK][u:3%][v:][w:E][x:][y:H][z:]} ywrotesomeifdicatsteeatrushtointinecalnedachandtrypattsseenoutreertasanerestintmessentersterontletoasatnetgubadlocoordiacapistresstrendatinehearatelchistoredoespertsitecodernnewistethintrailerpinnisnoesetriaactesintelaberectssinasholoseeeceofnasalwoodanoeilouretartsoingcarliedfilieteheindeenhlrhlmrayinantitethiseestrenollonsysteararehitbe
Wed May 01 12:14:28 CEST 2019 Climber 7 Iterations: (52,334) Score: -1,641616985495204E4 COPIED,"[a-z]+",{[a:@23SfWZ/][b:j][c:VK][d:c][e:BdH9+.][f:1][g:y][h:Pbk][i:T7l|>][j:][k:][l:ADJ][m:&][n:5UG)*<][o:q6(8X:-][p:%][q:][r:RCz_][s:tMN^O][t:p4FYL][u:#][v:][w:E][x:][y:;][z:]} ewritischhiftinaesteeousplatenncanegodsenanaonturelattisioncatseercaresersbeinthemthatorstisonerhtialounecopyandoncertionalabuserstreateensehearatedglotcorinceslestintegoneranswiltethiscsandirlinnitseemetroeaguesintedoyorintssanathedctseegeofnoterwicnanoeidearecostseasonardietfornetiaeinteenarradhsaeiaontitheanisertsenordislestharasolicye
...
Wed May 01 12:14:30 CEST 2019 Climber 7 Iterations: (52,336) Score: -1,640216846970138E4 COPIED,"[a-z]+",{[a:@23fWHZ/][b:j][c:VK][d:c][e:BSd+.][f:1][g:y][h:Pbk][i:T7l|>][j:][k:][l:ADJ][m:&9][n:5UG)*<][o:q6(8X:-][p:%][q:][r:RCz_][s:tMN^O][t:p4FYL][u:#][v:][w:E][x:][y:;][z:]} awritischhiftinaesteeousplatenncanegodsenanaonturalettisioncatseercarmsersbeinthemthatorstisonerhtialounecopyandoncertionalabuserstreatmensehearatedglotcorinceslestintegoneranswiltethiscsandirlinnitseemetromeguesintedoyorintssanathedctseegeofnotmrwicnanoeidearecostseasonardietfornetiaeinteenarradhseaiaontitheanisertsenordislastherasolicye
...
Wed May 01 12:20:19 CEST 2019 Climber 5 Iterations: (119,845) Score: -1,639037591471840E4 COPIED,"[a-z]+",{[a:@23E6f8Xl][b:j][c:VY][d:c][e:BSdW+-.][f:)][g:y][h:Pbk][i:T7|>][j:][k:][l:RJ][m:t][n:D5UG<][o:q%(:Z/][p:K][q:][r:Cz;O][s:14MN^_][t:pFH9*L][u:#][v:][w:A][x:][y:&][z:]} talitaschhistinaesteeouronecentpafegodsenoneofturtlettisaatcatreelpastsermbeintheyshotalstareterhtianounepoorandancertionalaburesscreattenseheolatedgwasperanceslertintegenelonmaincethisprandallitnisseeyetrateguesintedoralantssanashedcsmeegeastostraicnoneeideareportseasonaldietsarnetaeeifteeferredhretiooftitheenimescreterdisntstherorawipre
...
Wed May 01 12:25:11 CEST 2019 Climber 4 Iterations: (187,1255) Score: -1,634977651329243E4 COPIED,"[a-z]+",{[a:28X:Zl/][b:E][c:W][d:cy][e:BSdH+.>][f:K][g:P][h:bjk][i:T6Vf|][j:][k:][l:RJ^][m:D][n:3U&G)<][o:qC7(-][p:Y][q:][r:%zO][s:@15LMN_][t:p4tF9*][u:#][v:][w:;][x:][y:A][z:]} ebletalighissinaesteeourrmcpentfinedadsenanconsurelettolaatintreelfastsertheistgenthatalltaroteohteamouseforwandaniersionalihuresspresstensehealateddyitforaniellertontedonelastbimpethisfrandallitsitseenetriteduesistedowalantslisatheditteedeastattobeinasoeidenrefortleisonaldiessionetaceinseencorcdgreeiaontithecnotespretoodesmestherarayifwe
...
Wed May 01 12:32:03 CEST 2019 Climber 6 Iterations: (279,1839) Score: -1,633703138902371E4 COPIED,"[a-z]+",{[a:28HXl/][b:J][c:P9][d:cy][e:BSd+.>][f:K][g:Z][h:bjk][i:T6Vf:|][j:][k:][l:R^][m:DE][n:3U&G)<][o:qC7(-][p:Y][q:][r:%zO_][s:@15WLMN][t:p4tF*][u:#][v:][w:;][x:][y:A][z:]} amletalichissinaesteeourrmspentfinedidsengnsonsurabettolaatintreelfarcsertheistcenthatalltaroteohteamouseforwandaniersionabihurerspresscenseheglateddyitforanielbertontedonelastmimpethisfrandalbitsitseenetriceduesistedowalantslisatheditteedeastitcomeingsoeidenrefortleisonaldiessionetaseinseensorsdcreaiaontithesnoterpretoodesmasthergrayifwe
...
Wed May 01 12:45:46 CEST 2019 Climber 4 Iterations: (455,2990) Score: -1,632874327057550E4 COPIED,"[a-z]+",{[a:28XZlN/][b:j][c:F][d:c&][e:BSd:+.>][f:K][g:*][h:bEk][i:T6Vf|][j:][k:][l:RJ^][m:y][n:5G)<][o:qC7(-][p:HY][q:][r:%UzO][s:@3DWLM_][t:Pp14t9][u:#][v:][w:;][x:][y:A][z:]} phletalithitsinaeateeourrsspengfinemedseransonsurplettolaagistreelfastsertbeinctedthatallcarogeohceasouneforwardaniersionaliburesapresstensehealacedmyitforariellertoncemorelanthispethisfrandallignitseedecritemueaincedowalancalinatheditteemeatgettoheiranoeidesrefortleisonaldiestionecaseinseensorsdtrepiaontichesnotespregoodesspatherarayifwe

That of course looks attractive to waste human time searching for some words, but unfortunately has never delivered anything meaningful so far, and is not stable either. No surprise, the algorithm is built to find something that resembles language. From experience, the first that converges are the positions of the vowels and consonants, and similar sounds go to similar places (e,a,i, or p,b,v).

I leave it running a bit, and then invest more time into reasoning the 7.0, or to maybe enhance it with some mathematical argument. And another good step would be to move from Java to C++, to increase the speed, and safe some electricity and pollute the environment a bit less with the search.

Posted : May 1, 2019 2:24 pm

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Hey f.reichmann,

Thanks for your explanations, though I am only but a hardworking layman and at best only understand 50% of what you say.

I still do not understand what log(b!) is and where it should fit in:

1. Sum the n-gram logs of the solution plain text into a score (higher is better).
2. Divide the score by the total amount of n-grams in the solution plain text (normalization).
3. Divide the score by the IOC in such a way that a higher IOC lowers the score to keep the solution plain text away from stuff like "THESETHESETHESE" as you mentioned.

With the real 408 with its much more inhomogeneous distribution, my implementation of the hill-climber has difficulties converging. I need to start 8 threads, for getting a quicker result, and thread 5 did was the quickest today:

That’s great! Getting a solve on the 408 is what starts it and now you can start optimizing.

And that’s it. No IoC, no Chi^2, no entropy forcing the 1-gram distribution back to sanity is used.

It appears that you are using the 1-gram frequencies with a higher weight to control the frequencies so that is similar to using Chi^2. My thinking/experience is that it constrains the freedom of the hill-climber and using the IOC or entropy may be better in that respect. Using entropy is simple "score = summed_5gram_logs * entropy" and it should work great.

The OANC dictionary is a 68M corpus from http://www.anc.org/ , so that it is actually quite small.

Yep, but for up to 5-grams it should be okay.

From experience, the first that converges are the positions of the vowels and consonants, and similar sounds go to similar places (e,a,i, or p,b,v).

You are totally right, I noticed this as well and have some optimizations that play into it.

I leave it running a bit, and then invest more time into reasoning the 7.0, or to maybe enhance it with some mathematical argument. And another good step would be to move from Java to C++, to increase the speed, and safe some electricity and pollute the environment a bit less with the search.

You know your stuff, your solver may become quite powerful.

AZdecrypt

Posted : May 1, 2019 3:52 pm

Zodiac Discussion Forum