Damned bean counters.
-glurk
——————————–
I don’t believe in monsters.
Thanks doranchak,
So not accept if none of the ngram scores are better, and accept if at least one of the ngram scores is better?
Yes, with an additional step:
Once accepting a new solution, check the old ones that you are keeping track of. If any of them are completely dominated by the new solution, remove them. A "completely dominated" solution means none of its ngram scores is better than the corresponding ngram score for the new solution.
Also, depending on how the search proceeds, the list of non-dominated solutions may grow too large, so you have to come up with a strategy to trim them to keep the search speedy.
It’s an interesting idea, but what exactly does using 4/5/6-grams in parallel accomplish? I must’ve missed an important key point in the discussion. 🙂
In my experience higher-order N-grams tend to be much more reliable in scoring the correct plaintext, and the lower-order N-grams tend to be more tolerant of misspellings. Just like Jarvle (although I’m not sure about his exact reasons), I’m actually maintaining an option to run my solver with either 6- or 7-grams. If I don’t get a stable solve with one, I try the other. Higher-multiplicity ciphers tend to be solved by 7-grams more cleanly, where 6-grams produce a lot of nearly equally scored plaintext, only one of which is the correct solution. On the other hand, some of the ciphers get solved by 6-grams almost right away, and the 7-grams require a lot of restarts to get to the correct solution. Seems to be due to misspellings or liberal use of rare words in the plaintext. Makes sense: shorter N-grams "fit" around the misspellings easier, compared to longer ones.
Thus,
Every cipher has a fixed amount of different scores.
Not accept if none of the scores are better than all of the ciphers in the list. Accept if at least one of the scores is better than one of the ciphers in the list.
If accepted then remove the ciphers in the list for which all scores are lower. If none of the ciphers in the list can be removed and the list size limit is met then do not keep the accepted cipher.
If iterations limit is met then output cipher list.
Okay?
It’s an interesting idea, but what exactly does using 4/5/6-grams in parallel accomplish? I must’ve missed an important key point in the discussion.
The idea is that a single objective search might get stuck in local optima, whereas multiple objectives might allow for a better sampling of broader criteria, to increase the chances of approaching the global optimum.
I think it really depends on the problem. Sometimes a good solution has a tradeoff among measurements. For example, if we want our solvers to handle ciphers that are just lists of names or places, we have to use different ngram stats. So you could have one ngram score based on normal language stats, and another based on expected stats for lists of names/places. Without knowing which kind of cipher we have, you could track both objectives, and maintain a set of non-dominated solutions that explore the tradeoffs between those objectives.
Thus,
Every cipher has a fixed amount of different scores.
Not accept if none of the scores are better than all of the ciphers in the list. Accept if at least one of the scores is better than one of the ciphers in the list.
Yes, but to be clear: Accept if at least one of the scores is better than one of the ciphers in the list, even if the existing cipher has another score that is worse.
So, new cipher has score [1,4]. Cipher in list has score [3,2]. Keep both, because neither is "totally better" than the other.
If accepted then remove the ciphers in the list for which all scores are lower. If none of the ciphers in the list can be removed and the list size limit is met then do not keep the accepted cipher.
That works but you might get too many samples in one "spot" on the search landscape. The algorithm I’ve used (SPEA2) uses a density function to distribute solutions equally along the objectives, so it can remove the extras fairly (so they don’t cluster).
A simpler approach is to just accept the new cipher and then remove one at random.
If iterations limit is met then output cipher list.
Okay?
Yes.
It’s an interesting idea, but what exactly does using 4/5/6-grams in parallel accomplish? I must’ve missed an important key point in the discussion. 🙂
The idea is that a single objective search might get stuck in local optima, whereas multiple objectives might allow for a better sampling of broader criteria, to increase the chances of approaching the global optimum.
Ah, I see. A different approach to help with the "sparsity" of the higher N-gram samples. There is actually a number of methods that have been developed for the word-level N-grams, but they can be equally applied to letter-level N-grams. Except that they require quite a bit of math background to even begin to understand the ideas: http://www.ee.columbia.edu/~stanchen/pa … report.pdf
I’ve given up trying to get everything they are talking about and used a simple interpolation method that uses lower-order N-gram data for higher N-grams that are missing (i.e. not present in the corpus). I’ve seen people use weighted average of different N-grams, something like this:
combined_score = 0.2 * 3_grams_score + 0.3 * 4_grams_score + 0.5 * 5_grams_score
The weights (0.2, 0.3, 0.5) can be different, but they have to add up to 1. The problem I found with this approach is that if you already have a reliable 5-grams score, the lower-order scores will "contaminate" it. So for my solver I computed the scores "recursively" instead. For each N-gram level, if the score is over a certain small threshold, I use that value, if it’s less (the N-gram is not present in the corpus, or found just once or twice), I average the scores of overlapping lower-order N-grams and subtract a small "penalty" value. Starting with 3-grams, as all 2-grams are pretty much present. Then compute 4-grams data, using 3-grams, and so forth "recursively" one level up at a time. For example, if you have a "ZFACT" 5-gram that’s not present in the corpus (probably; it’s just for illustration purposes). Instead of assigning 0 score to it, I look at overlapping "ZFAC" and "FACT" lower-order 4-grams, one is probably 0 as well, but the other one is pretty well represented, so let’s say it has a score of 100. So I assign the score of (0 + 100) / 2 – 20 = 30. Where 20 is the "penalty". Obviously, if you end up with less than 0, you clamp it to 0. So you end up with "interpolated" 5-grams data, that combine data from 3-grams and 4-grams, which means you only need to do 1 lookup, instead of 3.