Something you may not have heard of are "higher-order homophonic ciphers". The name is misleading but it allows the encipherer to stack 2 texts on top of each other. Given the amount of symbols this encipherment produces it does not fit the Z408 or Z340 in any way.
Example from: http://www.uobabylon.edu.iq/eprints/pap … 568_49.pdf
A 3rd-order homophonic would then have a 3-dimensional matrix and so 3 texts can be stacked on top of each other.
Example cipher:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 20 26 9 27 28 29 30 31 32 33 15 34 35 36 37 34 38 32 39 40 41 14 42 43 44 45 46 47 48 49 29 50 51 52 53 54 28 55 11 56 44 57 21 58 59 60 61 62 19 63 36 64 65 66 67 68 69 70 28 71 72 73 74 75 76 77 5 78 45 35 79 9 72 80 81 82 83 84 53 85 86 87 88 89 72 90 2 91 92 57 52 68 70 34 53 76 12 93 60 94 95 96 51 97 64 74 98 37 99 27 100 101 102 103 104 105 7 67 43 44 57 106 107 15 108 109 110 111 51 19 112 113 114 115 116 117 118 2 72 119 47 58 49 64 120 70 121 122 123 124 125 65 126 127 11 128 129 130 22 11 131 132 89 112 133 134 135 15 136 7 137 138 75 45 139 2 45 72 140 141 89 142 118 143 91 144 145 35 146 9 36 147 2 148 149 24 53 150 72 151 152 153 148 154 84 155 57 156 157 53 25 82 132 158 159 71 60 112 41 160 97 161 81 127 95 162 15 74 38 115 163 126 164 27 165 45 35 38 166 95 167 20 168 45 169 152 170 171 126 75 138 172 173 102 149 174 45 45 27 28 68 175 176 73 64 58 99 28 29 177 9 178 95 179 63 180 57 181 124 156 182 183 184 100 146 26 183 8 185 95 186 187 89 98 167 15 188 189 71 81 190 191 65 192 119 107 113 154 193 79 76 194 181 146 60 195 41 2 143 133 196 191 197 10 8 198 183 199 200 201 107 156 202 193 8 203 28 71 159 204 205 182 206 207 26 208 18 154 5 209 7 210 106 211 212 20 58 138 83 213 11 214 215 41 109 9 45 148 216 217 44 91 179 196 218 219 49 167 220 221 146 120 10 222 223 217 44 57 66 168 195 214 224 70 225 65 40 134 154 226 227 110 23 44 35 228 229 129 230 58 231 65 232 139 233 234 204 235 236 112 70 51 161 237 55 45 238 11 74 239 95 186 240 17 200 161 26 74 67 231 217 37 8 5 172 241 119 242 151 154 7 177 163 243 244 245 71 172 111 91 246 161 61 219 9 226 129 247 91 248 45 74 23 249 20 138 167 71 250 182 124 251 126 96 252 92 35 38 253 111 5 155 115 103 151 3 151 247 231 254 255 64 122 98 50 26 97 60 145 33 256 231 101 5 257 258 103 45 71 245 25 259 17 19 260 72 140 261 154 35 18 167 68 262 163 177 101 193 263 230 93 140 242 36 264 42 97 265 217 151 10 214 223 22 231 266 38 50 31 95 267 268 20 36 200 240 62 172 269 32 196 15 270 41 60 271 272 11 273 195 179 274 275 230 215 237 263 258 217 276 277 44 91
The example cipher has two 2 valid 1-to-1 decodings:
DEARSCOTTIAPPRECIATEYOURLETTER ANDTHECHANCETOWORKWITHYOUANDYO URFAMILYOFSOFTWAREAUTHORSIPROM ISETHATIWILLWORKDILIGENTLYTOCO MPOSEANDARRANGEMUSICYOUWILLBEP ROUDTOINCLUDEINYOURSOFTWAREIHA VETOAPOLOGIZEFORNOTYETINCLUDIN GTHETAPEIPROMISEDITWILLFOLLOWA SSOONASIHAVERECEIVEDTHELATESTV ERSIONOFSPUTTERTHEPROGRAMTHATW ILLALLOWMETOTRANSLATEMYSIXTEEN BITSAMPLESTOEIGHTBITSOUNDBLAST ERSAMPLESTHATSHOULDBEONLYASHOR TTIMEINTHEFUTURESINCEILASTSPOK EWITHYOUIHAVECOMEBYSOMEGOODSOF TWAREANDINSTRUMENTSOUNDSTOMETH ESOUNDSOFTHEINSTRUMENTSPLAYING THEMUSICISASIMPORTANTASTHEMUSI CITSELFITHINKTHATTHEMUSICONDAR KAGESISCOMMENDABLEANDFORTHEMOS TPARTTHECHOICEOFIN
And:
ASIVEMENTIONEDINATLEASTONEOTHE RPOSTIRECORDSOUNDSWHENEVERITRA VELORHEARSOMELOCALUSABLESOUNDI VEBEENTOQUITEAFEWZOOSSEVERALRA INFORESTSANDOTHERPLACESWHEREIM IGHTLUCKUPONSOMEREALLYUSABLESO UNDSOFCOURSEIVEMADEUSEOFSOUNDE FFECTCDSTOOALOTOFTHEMECHANICAL SOUNDSIUSEDINDUKETHREEDWEREREC ORDEDATAPOGEEHEADQUARTERSTHECO KEMACHINECOINSERVODRINKSUPPLYM ECHANISMCOINSDROPPINGINTHECHAN GESLOTHUMOFTHEREFRIGERATIONUNI TETCTHECOPIERCYCLINGTHEURINALF LUSHINGTHEGENERALBUZZOFMULTIPL ECONVERSATIONSATONETIMETHESKYW ASTHELIMITINRECENTYEARSIHAVESA VEDTHEMULTITRACKAUDIOSOFTWAREF ILESANDALLOFTHETRACKSRAWEFFECT SSETTINGSAUDIOEFFECTSETCSOIKNO WHOWICAMEUPWITHTHE
Initially it was very counter-intuitive to me that a cipher could decode to 2 different plain texts like this. Very cool stuff!
A solver will generally pick up on the dominant text, to get the other text one could "anti-crib" the dominant.
That is pretty neat. Your sample cipher has multiplicity of 0.45. I wonder how low the multiplicity can get. The cipher alphabet size for 2nd order homophonic is capped at 26*26 = 626 letters so with really long messages, the multiplicity is very low and at least one of the solutions should be easy to find via a solver.
3rd order would have a cipher alphabet size of 17,576 so if you had a message that was, say, 200,000 characters long, the multiplicity should be very low and one of the solutions should pop out easily via solver.
I think you have found a good counterexample to the claim about unicity distance here: http://practicalcryptography.com/crypta … tatistics/
Cipher: Homophonic Substitution cipher
Number of keys: N!/(N-26)!*26^(N-26)
Unicity distance: 1.47N – 49.91 + 0.45(0.5+N)Log[N] – 0.45(N-25.5)Log[N-26] (approximate. N is the number of cipher letters. N must be greater than 26. Unicity for N=30 is 38.1 characters)
For your test cipher, I think N=277, which means the unicity distance (minimum length cipher needed to guarantee a UNIQUE solution) comes out to 434 based on that formula. Your cipher has length 618 which well exceeds that, yet it has more than one unique solution.
But if N=626 (the maximum alphabet size for 2nd order homophonic), then unicity distance estimate is 957 which is longer than your cipher. But you could simply extend your two messages to easily exceed that and yet still have non-unique solutions.
Unicity distance confuses me.
Perhaps for this kind of cipher, it is more accurate to say: if the length of the message is longer than unicity distance, then the chances that one of the non-unique solutions is unintentional is extremely low.
Yes, very confusing, is the unicity distance calculation different per cipher type? Because while it is called "higher-order homophonic" I would say it is rather a "higher-order substitution". And even then, what does "higher-order" imply?
It’s hard for me to put a name on it but "substitution * substitution" seems valid. 3rd order would be "substitution * substitution * substitution".
minimum length cipher needed to guarantee a UNIQUE solution
Interpretation is important here, because when you go gradually under the minimum length it is probably not so that all of a sudden you have another totally unique solution, it’s just that small changes can start to happen to the solution plain text. At least that’s what I have observed with high-multiplicity ciphers.
Yes, very confusing, is the unicity distance calculation different per cipher type?
Yes, but I think it is primarily driven by the number of possible keys, which for homophonic is based on the size of the cipher alphabet.
In that respect, n-order homophonics aren’t any different than regular homophonic, they just have larger cipher alphabets.
I was able to create toy ciphers that could be decoded to a unique message via substitution, and different unique message via Vigenere.
So an analyst would have to know that both cipher types were possible, and compute the unicity distance for both.
The number of possible keys for each cipher type is different.
A trivial worst case for this is: Any cipher can be interpreted as a one-time pad with an arbitrary solution (you can just invent your own pad, with the same length as the cipher, to force the solution you want.)
Interpretation is important here, because when you go gradually under the minimum length it is probably not so that all of a sudden you have another totally unique solution, it’s just that small changes can start to happen to the solution plain text. At least that’s what I have observed with high-multiplicity ciphers.
True – it is more like a smooth transition between probabilities.
Before unicity point: More likely to be spurious solutions, especially as cipher length grows smaller.
After unicity point: Less likely to be spurious solutions, especially as cipher length grows larger.
I’m curious: You could probably make a 2nd-order homophonic with a relatively small cipher alphabet, if you picked two different plaintext messages that use the fewest letters. For example, if you only use the top 10 letters in English to construct messages, then you’d only need 100 cipher symbols to represent both messages.
Here’s a list of almost 800 words that can be made from ETAOINSHRD, the top 10 letters in English: https://new.wordsmith.org/anagram/anagr … source=adv
Seems like you could make two reasonable messages out of those words and then produce a 2nd-order homophonic using only 100 cipher symbols.
Yes, but I think it is primarily driven by the number of possible keys, which for homophonic is based on the size of the cipher alphabet.
In that respect, n-order homophonics aren’t any different than regular homophonic, they just have larger cipher alphabets.
At least some of the key space, or even a big chunk of it are meaningless/impossible keys like "AAAAAA…" or "QQQZZZQQQ…", how about that?
I see, guess that’s why they called it homophonic.
A trivial worst case for this is: Any cipher can be interpreted as a one-time pad with an arbitrary solution (you can just invent your own pad, with the same length as the cipher, to force the solution you want.)
But then the pad would consist of random characters. It would not be a realistic looking plain text.
I’m curious: You could probably make a 2nd-order homophonic with a relatively small cipher alphabet, if you picked two different plaintext messages that use the fewest letters. For example, if you only use the top 10 letters in English to construct messages, then you’d only need 100 cipher symbols to represent both messages.
Out of the box thinking. I like that. You could go further and encode the plain text to a polyalphabetic with 8 symbols, where each symbol are 3 plain text letters, 8 by 8 is then 64 or 63 if you will in a Z340 format. Though no cycles, so highly unlikely.
Example cipher of the above for fun:
#J!:FPQS41$F@O,YQ _^H#[5Z+R-24LVEXH ^BL^)!_;?OZO-[.0! J!9&1^A#]C>I&8.$% G)X97;J/_32/M;&!W #8YCE89+_)-NGQ<_@ 4%"NP*QI3#QY:_OA* %DZA#64]C%#AU]1VV $*9T,>;^,ZE5?XFB4 S6-IOH<5?J>D1VNYI D55J$UY(7KV(_.0!Y CD=%3V4#!,(EI(AT, <ON_>.@19.*V0!Q;8 &A5&@8%QO5J/[@?. *T9;:24(H#1.$D< ,W"8Q(UP+%9;6!&N0 5,2/D?ND$'S#R&CET B@$#Q3!;YZRMD^41E >'X;-45UC1D%E&02/ 9D7+6.!,:G].S.FO%
Yes, but I think it is primarily driven by the number of possible keys, which for homophonic is based on the size of the cipher alphabet.
In that respect, n-order homophonics aren’t any different than regular homophonic, they just have larger cipher alphabets.At least some of the key space, or even a big chunk of it are meaningless/impossible keys like "AAAAAA…" or "QQQZZZQQQ…", how about that?
I believe that is taken into account somewhat in the calculation of the number of possible keys for unicity distance for homophonic ciphers: N!/(N-26)!*26^(N-26).
If meaningless keys were included then I guess it would just be 26^N.
A trivial worst case for this is: Any cipher can be interpreted as a one-time pad with an arbitrary solution (you can just invent your own pad, with the same length as the cipher, to force the solution you want.)
But then the pad would consist of random characters. It would not be a realistic looking plain text.
Random pads are more secure (unless someone finds a copy of the pad).
Otherwise, If the pad is a plaintext taken from some source, then an attacker could locate the source and try it out.
Example cipher of the above for fun:
#J!:FPQS41$F@O,YQ _^H#[5Z+R-24LVEXH ^BL^)!_;?OZO-[.0! J!9&1^A#]C>I&8.$% G)X97;J/_32/M;&!W #8YCE89+_)-NGQ<_@ 4%"NP*QI3#QY:_OA* %DZA#64]C%#AU]1VV $*9T,>;^,ZE5?XFB4 S6-IOH<5?J>D1VNYI D55J$UY(7KV(_.0!Y CD=%3V4#!,(EI(AT, <ON_>.@19.*V0!Q;8 &A5&@8%QO5J/[@?. *T9;:24(H#1.$D< ,W"8Q(UP+%9;6!&N0 5,2/D?ND$'S#R&CET B@$#Q3!;YZRMD^41E >'X;-45UC1D%E&02/ 9D7+6.!,:G].S.FO%
Can AZdecrypt crack that?
Random pads are more secure (unless someone finds a copy of the pad).
Otherwise, If the pad is a plaintext taken from some source, then an attacker could locate the source and try it out.
Good point.
Can AZdecrypt crack that?
With the "polyphones" solver it technically could, but it requires more n-gram and compute power than what we have available now.
OK Jarlve, here’s a 351 character cipher challenge for you (or for anyone else here who wants to give it a try):
^3K2P_8D273t-:Z_8bO(->(TtJK W)yN%2:BtEW[cTUPl-P2[Kk<[AJ y<q9.k[X<Z3+[L(DW/yfM26SkG# @JEdkLStdl;N28c(-+LU/*:<c(. 92t&WzPT1|bU1Yq.H;ZfZfM@_)2 ;-.GS@8O_8q7c.#C4(1#WG[TM%W k54j|Tk%-kjA|JA2j6_86H<dqJ^ <tKkbWMR<VCR-2j6254KZ_|&:([ pUk[AK2[+;Z7FCy<fMAU67D.:Bk FkX-k+73+6HF;U<6lSZHzdPHL._ :GCpk<Y)(*/7KJ6lftf8PqUkG*W j+>*_:k1PFtZy&+U1+q-.yNHbJ( PlG228A)JHLCp/&;|l-+CB><:dN
The cipher alphabet contains 64 different symbols.
The cipher hides two messages using 2nd order homophonic substitution.
Both messages use a plaintext alphabet of only 8 letters.
There are two keys and each unlocks a message that is somewhat nonsensical but is completely readable and valid English.
The cipher’s multiplicity is 0.182, which is slightly less than Z340’s multiplicity.
Using http://jarlve.vdm-service.be with default settings does not produce a correct message.
Happy solving!
somewhat nonsensical
Not in the least!
I’ve been busy the last couple of days with creating a generalized "n-order homophonic" solver and here follows the output for your challenge cipher. I ended up lowering the entropy weight so that a plain text with the on target 8 letters rolled out but even at the standard settings the solve was pretty decent.
Spoiler alert: IHATETHATTHISISTHEASSASSINA TIONITISINTHISSENSETHATTHEN OTIONTHATSHEHASATOOTHTATTOI ANNOTATIONINTHISSEASONITISN OTINTHESHOESHEINSISTSTHATIT ISNOTAHATHITININSSHITOHSHIT TISHOSTISTHEONETHATHASTOINI TIATETHETONESTHATISASTONISH ESTHEATHEISTSNOTTHESATANIST STASTETHEASSISTANTSSHOESANT IONETTEISNOTANANTITHEISTONT HEANTITHESISONESHEISNONSENS ENOTTHEINSANEONIONSENSATION HEASTONISHESANTONIOSASSISTA NTHEISNOSANTAINTHATSTATETOT HEEASTTHETESTESINTHISSTATIO NTASTEASSHOESNASASENTONEASS ASSINATIONINONESHOTITISNOTS OASIANNOONEHASONESOONITISIN THEINITIATIONTOSITONTHESETH ESATINSHEINHASITSHEATONINST ANTTOASTSOTHENHEISONTHISNOT ETHATSHESTHEONETHATHASTHESO NINATENTSOTHATTHISINTENTION ISSOONTOTESTHISNOSEASHEHITS THISSNOTTHENATIONHASNOSENSE
Wow, nice job!
I’d love to hear a description of how that particular solver works.
UDPATED: Here are the original formatted plaintexts:
plaintext 1:
I HATE THAT THIS IS THE ASSASSINATION. IT IS IN THIS SENSE THAT THE NOTION THAT SHE HAS A TOOTH TATTOO ANNOTATION IN THIS SEASON. IT IS NOT IN THE SHOES. HE INSISTS THAT IT IS NOT A HAT HIT IN ONE SHOT. OH SHIT, THE HOST IS THE ONE THAT HAS TO INITIATE THE TONES THAT HE ASTONISHES. THE ATHEISTS, NOT THE SATANISTS, TASTE THE ASSISTANT’S SHOES. ANTIONETTE IS NOT AN ANTITHEIST ON THE ANTITHESIS. ONE, SHE IS NONSENSE, NOT THE INSANE ONION SENSATION.
plaintext 2:
HE ASTONISHES ANTONIO’S ASSISTANT. HE IS NO SANTA IN THAT STATE TO THE EAST. THE TESTES IN THIS STATION TASTE AS SHOES. NASA SENT ONE ASSASSINATION IN ONE SHOT. IT IS NOT SO ASIAN. NO ONE HAS ONE. SOON IT IS IN THE INITIATION TO SIT ON THESE. THE SATIN SHEEN HAS ITS HEAT ON INSTANT TOAST. SO THEN HE IS ON THIS NOTE THAT SHE’S THE ONE THAT HAS THE SON IN A TENT SO THAT THIS INTENTION IS SOON TO TEST HIS NOSE. AS HE HITS THIS SNOT, THE NATION HAS NO SENSE.
Both use only the letters A, E, H, I, N, O, S, and T.
These texts are funny.
I’d love to hear a description of how that particular solver works.
Instead of one decryption, multiple (n-order) decryptions are set up. Say that the solver does 500,000 iterations, then after each iterations it alternates to another decryption, for example if n=3 then it cycles like this: 1, 2, 3, 1, 2, 3, etc. A penalty factor is calculated, for example if in decryption 1 symbol 15 is letter E and in decryption 2 symbol 15 is also E then this penalized to some extent. After each iteration all the decryption scores are merged into a single "best score". Though at the end of each iteration the decryption is still evaluated individually but linked via the penalty factor. There is also a penalty factor weight.
So there is a individual evaluation for each decryption at the end of each iteration (with the penalty factor included). And there is also a global evaluation after each iteration (also with penalty factor). The global evaluation decides the solver output.
Very nice approach! Thanks for the explanation. It reminds me of "fitness sharing" which is used in genetic algorithms (GAs).
GAs usually involve a population of multiple solutions, and they eventually produce a dense cluster of solutions with high scores.
"Fitness sharing" is a way to penalize the clusters. If a solution is too close to other solutions, then its fitness it reduced.
Eventually, multiple clusters form and they each represent good solutions but are not similar to each other.
For ciphers, a GA might encode a solution as a key, so the penalty is based on measuring the distance of one key to another (by measuring how many elements they don’t have in common, for example).
Thanks for the explanation, I like the concept of "fitness sharing", perhaps it could prove useful in other parts of my hill-climber.
I still wonder, too, if AZdecrypt would benefit from multiobjective optimization.
So instead of having to deal with penalties, weights, etc. for a single score, you evaluate an array of scores for each solution.
[s1, s2, s3, … sn] (for n metrics)
For example, s1 could be a bigram score, s2 could be trigram score, s3 could be ioc compared to English, s4 could be entropy, s5 could be a sum of frequency-weighted words found in a dictionary, etc.
Here’s an example review of multiobjective simulated annealing: https://www.hindawi.com/journals/aor/2019/8134674/
The nice thing about tracking a set of non-dominated solutions is that you can maintain tradeoffs. For instance, a solution might have a really good trigram score but a terrible IOC. Another may have a really good ioc and entropy but terrible bigram score.
Exploring those tradeoffs in the solution space might allow for solutions in edge cases to be found more effectively. And you wouldn’t have to play games with all the adjustment factors when trying to combine multiple metrics into a single score.
Exploring those tradeoffs in the solution space might allow for solutions in edge cases to be found more effectively.
I agree. For example, using entropy is not always better than using IOC.
Does a "multiobjective" return multiple solutions then? To be evaluated by humans?