We have seen that there seem to be unusually large distance gaps between some of the symbols in the 340.
I came up with a simple new statistic and called it unigram distance (it is available in AZdecrypt 1.05 as such). It sums all the distances per symbol which comes out at 15034 for the 340. Interestingly this turns out be a rather large number which does not correlate with homophonic substitution. The observation is significant (but curious) as with randomizations the rarity seems to be somewhere in between 1 in 100.000 and 1 in 1.000.000. A 340 character part of the 408 scores only 13469.
Combinations processed: 1000000/1000000 Measurements: - Summed: 13302965150 - Average: 13302.96515 - Lowest: 11249 (Randomize(415042)) - Highest: 15053 (Randomize(414948))
One way to replicate the observation is to create a cipher with 2 different keys, one key for the first 10 rows and another for the 10 last rows (encoding restart hypothesis). Here follows such a cipher with a unigram distance of 14897 which closely matches that of the 340.
1 2 3 4 5 6 7 2 2 8 9 10 11 12 13 14 2 15 16 5 17 18 19 20 12 21 22 1 23 20 13 24 25 26 27 28 19 29 3 30 7 23 31 13 32 15 33 25 34 35 36 37 38 39 8 2 2 21 40 10 41 1 2 42 43 44 31 5 3 45 46 27 12 47 13 48 49 15 50 51 52 5 53 54 19 20 12 24 18 34 7 23 55 36 15 31 13 50 22 56 37 38 57 5 32 13 25 20 44 40 8 24 54 2 13 58 18 2 2 30 13 59 21 2 2 23 13 31 12 35 27 1 45 10 43 3 60 15 50 24 5 46 36 12 31 13 20 51 55 27 48 7 2 2 8 9 57 15 61 62 5 49 12 29 17 15 21 22 1 23 63 2 1 39 23 20 57 44 1 3 14 6 13 38 9 20 47 40 42 39 52 10 17 53 51 12 46 28 35 54 25 58 58 32 26 57 62 19 9 42 43 41 44 24 1 59 20 36 14 30 33 3 47 63 58 26 40 42 27 57 6 13 44 22 62 1 39 26 8 42 20 26 34 42 29 41 59 1 51 20 15 17 12 38 26 39 56 19 43 33 48 42 16 1 13 38 8 19 29 41 57 24 20 26 6 33 2 1 31 42 29 41 20 60 32 26 29 41 23 1 55 46 61 20 7 37 11 29 13 2 1 54 42 5 26 41 29 38 25 44 52 42 2 20 4 63 49 21 10 39 19 61 1 15 20 28 33 18 36 1 37 17 50 22 26 41 29 14 3 4
Another way to replicate the observation is to change the row order, you may try to place top over bottom. For instance, placing a 340 character part of the 408 top over bottom increases the unigram distance from 13469 to 14026. And with the 340 it decreases unigram distance from 15034 to 14302.
Here follows a period rollout for the 340, period 1 is the highest.
Periodic: (transposition, untransposition) -------------------------------------------------- Period 1: 15034, 15034 <--- Period 2: 14237, 13495 Period 3: 13907, 13568 Period 4: 13723, 13768 Period 5: 13364, 13221 Period 6: 12973, 12568 Period 7: 13606, 13208 Period 8: 13223, 12832 Period 9: 13331, 13243 Period 10: 13674, 12105 Period 11: 13690, 13032 Period 12: 13549, 13099 Period 13: 13185, 13723 Period 14: 13424, 13176 Period 15: 13454, 13320 Period 16: 13443, 13068 Period 17: 13190, 13381 Period 18: 12468, 13204 Period 19: 13198, 12372 Period 20: 13381, 13190 Period 21: 12949, 12704 Period 22: 12957, 13640 Period 23: 13450, 13241 Period 24: 12805, 13117 Period 25: 13061, 13626 Period 26: 13262, 13260 Period 27: 13181, 12933 Period 28: 13542, 13973 Period 29: 12886, 13297 Period 30: 13642, 13893 Period 31: 13032, 13690 Period 32: 12888, 13691 Period 33: 13434, 13156 Period 34: 12105, 13674 Period 35: 12222, 13832 Period 36: 13360, 13244 Period 37: 13568, 12962 Period 38: 13382, 13346 Period 39: 13541, 13193 Period 40: 13400, 13424 Period 41: 12772, 13035 Period 42: 13166, 13078 Period 43: 12765, 13240 Period 44: 12634, 13362 Period 45: 13105, 13577 Period 46: 13188, 13374 Period 47: 13264, 13563 Period 48: 13755, 13771 Period 49: 12922, 13649 Period 50: 13224, 13734 Period 51: 13197, 13535 Period 52: 12979, 13173 Period 53: 12851, 13578 Period 54: 12889, 13663 Period 55: 12786, 13801 Period 56: 12822, 13557 Period 57: 12878, 12971 Period 58: 13105, 13265 Period 59: 13267, 13567 Period 60: 13465, 13719 Period 61: 13636, 13335 Period 62: 13683, 13665 Period 63: 13609, 13630 Period 64: 13828, 14204 Period 65: 13531, 13541 Period 66: 13543, 13369 Period 67: 13441, 12940 Period 68: 13221, 13364 Period 69: 12979, 13541 Period 70: 12940, 13476 Period 71: 13129, 13294 Period 72: 12761, 13411 Period 73: 13325, 13279 Period 74: 13379, 13279 Period 75: 13338, 13250 Period 76: 13257, 13593 Period 77: 13181, 13468 Period 78: 13172, 13470 Period 79: 13416, 13377 Period 80: 13432, 13512 Period 81: 13147, 13775 Period 82: 13145, 13904 Period 83: 12973, 13849 Period 84: 13590, 13626 Period 85: 13768, 13723 Period 86: 13684, 13607 Period 87: 13607, 13150 Period 88: 13593, 13097 Period 89: 13790, 13096 Period 90: 13850, 13464 Period 91: 13951, 13633 Period 92: 13985, 13765 Period 93: 13537, 13799 Period 94: 13776, 13977 Period 95: 13875, 14022 Period 96: 13841, 14196 Period 97: 14044, 14148 Period 98: 14081, 13968 Period 99: 13683, 14025 Period 100: 13648, 14117 Period 101: 13687, 13826 Period 102: 13722, 13877 Period 103: 13750, 13883 Period 104: 13636, 13652 Period 105: 13544, 13514 Period 106: 13712, 13595 Period 107: 13405, 13306 Period 108: 13277, 13395 Period 109: 13155, 13605 Period 110: 13059, 13674 Period 111: 12940, 13852 Period 112: 12980, 13860 Period 113: 12886, 13739 Period 114: 13578, 13846 Period 115: 13691, 14037 Period 116: 13575, 14092 Period 117: 13573, 14005 Period 118: 13518, 14052 Period 119: 13587, 14056 Period 120: 13559, 13990 Period 121: 13357, 13832 Period 122: 13395, 13885 Period 123: 13291, 13861 Period 124: 13239, 13964 Period 125: 13308, 13889 Period 126: 13279, 13825 Period 127: 13305, 13723 Period 128: 13273, 13700 Period 129: 13254, 13619 Period 130: 13455, 13625 Period 131: 13535, 13693 Period 132: 13775, 13528 Period 133: 13755, 13642 Period 134: 13625, 13477 Period 135: 13594, 13352 Period 136: 13548, 13151 Period 137: 13733, 13237 Period 138: 13638, 13489 Period 139: 13705, 13364 Period 140: 13797, 13476 Period 141: 13803, 13365 Period 142: 13738, 13495 Period 143: 13603, 13642 Period 144: 13693, 13849 Period 145: 13616, 13798 Period 146: 13732, 13681 Period 147: 13664, 13579 Period 148: 13539, 13590 Period 149: 13795, 13695 Period 150: 13947, 13923 Period 151: 14054, 14067 Period 152: 13878, 14161 Period 153: 13952, 14178 Period 154: 14012, 14257 Period 155: 14092, 14319 Period 156: 13994, 14337 Period 157: 13955, 14681 Period 158: 13693, 14662 Period 159: 13700, 14707 Period 160: 13767, 14828 Period 161: 13755, 14671 Period 162: 13430, 14715 Period 163: 13380, 14761 Period 164: 13181, 14873 Period 165: 13047, 14792 Period 166: 13246, 14679 Period 167: 13292, 14560 Period 168: 13204, 14345 Period 169: 13505, 14344 Period 170: 13495, 14237 Period 171: 13519, 14179 Period 172: 13531, 14140 Period 173: 13567, 14185 Period 174: 13609, 14094 Period 175: 13623, 14181 Period 176: 13549, 14121 Period 177: 13540, 14314 Period 178: 13439, 14313 Period 179: 13329, 14306 Period 180: 13365, 14314 Period 181: 13302, 14281 Period 182: 13325, 14283 Period 183: 13193, 14239 Period 184: 13267, 14310 Period 185: 13245, 14338 Period 186: 13179, 14309 Period 187: 13188, 14285 Period 188: 13177, 14306 Period 189: 13184, 14270 Period 190: 13175, 14259 Period 191: 13210, 14281 Period 192: 13151, 14291 Period 193: 13075, 14216 Period 194: 13118, 14203 Period 195: 13179, 14224 Period 196: 13225, 14202 Period 197: 13158, 14182 Period 198: 13185, 14157 Period 199: 13204, 14089 Period 200: 13251, 14137 Period 201: 13209, 14139 Period 202: 13309, 14146 Period 203: 13340, 14133 Period 204: 13348, 14141 Period 205: 13356, 14087 Period 206: 13201, 13986 Period 207: 13179, 13968 Period 208: 13209, 13950 Period 209: 13173, 13923 Period 210: 13185, 13961 Period 211: 13171, 13925 Period 212: 13209, 13974 Period 213: 13229, 13966 Period 214: 13238, 13950 Period 215: 13184, 13938 Period 216: 13228, 13905 Period 217: 13157, 13893 Period 218: 13153, 13945 Period 219: 12986, 13823 Period 220: 12958, 13817 Period 221: 12998, 13840 Period 222: 13017, 13857 Period 223: 13117, 13925 Period 224: 13131, 13809 Period 225: 13167, 13780 Period 226: 13204, 13825 Period 227: 13250, 13856 Period 228: 13270, 13813 Period 229: 13312, 13815 Period 230: 13331, 13827 Period 231: 13302, 13619 Period 232: 13305, 13596 Period 233: 13332, 13561 Period 234: 13374, 13619 Period 235: 13397, 13599 Period 236: 13406, 13508 Period 237: 13471, 13527 Period 238: 13357, 13489 Period 239: 13378, 13443 Period 240: 13547, 13419 Period 241: 13555, 13329 Period 242: 13582, 13571 Period 243: 13585, 13555 Period 244: 13626, 13745 Period 245: 13652, 13582 Period 246: 13708, 13473 Period 247: 13724, 13550 Period 248: 13733, 13612 Period 249: 13710, 13749 Period 250: 13750, 13743 Period 251: 13723, 13733 Period 252: 13680, 13749 Period 253: 13737, 13608 Period 254: 13660, 13545 Period 255: 13702, 13511 Period 256: 13741, 13520 Period 257: 13683, 13498 Period 258: 13739, 13497 Period 259: 13747, 13589 Period 260: 13784, 13570 Period 261: 13779, 13549 Period 262: 13777, 13560 Period 263: 13749, 13559 Period 264: 13700, 13585 Period 265: 13576, 13654 Period 266: 13729, 13676 Period 267: 13798, 13624 Period 268: 13819, 13596 Period 269: 13835, 13584 Period 270: 13877, 13563 Period 271: 13913, 13575 Period 272: 13929, 13580 Period 273: 14009, 13618 Period 274: 14051, 13578 Period 275: 14019, 13618 Period 276: 14086, 13615 Period 277: 14099, 13723 Period 278: 14101, 13729 Period 279: 14009, 13584 Period 280: 14068, 13583 Period 281: 14101, 13611 Period 282: 14033, 13594 Period 283: 13944, 13598 Period 284: 13859, 13646 Period 285: 13898, 13878 Period 286: 13945, 13856 Period 287: 13698, 13859 Period 288: 13699, 14092 Period 289: 13795, 14071 Period 290: 13793, 14071 Period 291: 13824, 14031 Period 292: 13856, 14005 Period 293: 13976, 14004 Period 294: 13944, 13960 Period 295: 13944, 13921 Period 296: 13901, 13895 Period 297: 13924, 13868 Period 298: 13945, 14072 Period 299: 13953, 14070 Period 300: 13925, 14102 Period 301: 13874, 14358 Period 302: 13804, 14365 Period 303: 13766, 14378 Period 304: 13834, 14381 Period 305: 13859, 14331 Period 306: 13825, 14366 Period 307: 13859, 14367 Period 308: 13786, 14283 Period 309: 13801, 14478 Period 310: 13822, 14618 Period 311: 13875, 14579 Period 312: 13894, 14550 Period 313: 13815, 14614 Period 314: 13799, 14589 Period 315: 13942, 14607 Period 316: 13925, 14698 Period 317: 13791, 14781 Period 318: 14047, 14764 Period 319: 14015, 14724 Period 320: 14107, 14746 Period 321: 14139, 14700 Period 322: 14194, 14615 Period 323: 14204, 14586 Period 324: 14180, 14653 Period 325: 14212, 14654 Period 326: 14291, 14761 Period 327: 14370, 14788 Period 328: 14430, 14864 Period 329: 14462, 14881 Period 330: 14448, 14885 Period 331: 14467, 14855 Period 332: 14567, 14835 Period 333: 14551, 14867 Period 334: 14683, 14853 Period 335: 14851, 14905 Period 336: 14915, 14937 Period 337: 14877, 14953 Period 338: 14872, 14995 Period 339: 14951, 15005 Period 340: 15034, 15034 <--- -------------------------------------------------- Transposition average: 13556.38 Untransposition average: 13853.61
Here follows a unigram distance frequency rollout for the 340.
Unigram distance frequencies: -------------------------------------------------- Distance 1: 4 Distance 2: 1 Distance 3: 2 Distance 4: 2 Distance 5: 2 Distance 6: 1 Distance 7: 3 Distance 8: 4 Distance 9: 3 Distance 10: 4 Distance 11: 2 Distance 12: 3 Distance 13: 6 Distance 14: 6 Distance 15: 2 Distance 16: 5 Distance 17: 5 Distance 18: 3 Distance 19: 4 Distance 20: 3 Distance 21: 6 Distance 22: 2 Distance 23: 8 Distance 24: 7 Distance 25: 4 Distance 26: 5 Distance 27: 2 Distance 28: 4 Distance 29: 6 Distance 30: 2 Distance 31: 4 Distance 32: 5 Distance 33: 3 Distance 34: 3 Distance 35: 6 Distance 36: 3 Distance 37: 2 Distance 38: 1 Distance 39: 3 Distance 40: 2 Distance 41: 4 Distance 42: 4 Distance 43: 4 Distance 44: 1 Distance 45: 1 Distance 46: 3 Distance 47: 1 Distance 48: 4 Distance 49: 4 Distance 50: 2 Distance 51: 1 Distance 52: 3 Distance 55: 4 Distance 56: 2 Distance 57: 1 Distance 59: 3 Distance 60: 2 Distance 61: 1 Distance 62: 1 Distance 63: 1 Distance 65: 1 Distance 66: 3 Distance 67: 1 Distance 68: 1 Distance 69: 1 Distance 70: 2 Distance 71: 7 Distance 72: 1 Distance 73: 1 Distance 74: 2 Distance 75: 2 Distance 76: 3 Distance 77: 2 Distance 78: 4 Distance 80: 1 Distance 81: 1 Distance 82: 1 Distance 83: 2 Distance 84: 3 Distance 87: 1 Distance 89: 1 Distance 91: 1 Distance 92: 2 Distance 95: 1 Distance 98: 1 Distance 101: 2 Distance 102: 2 Distance 103: 3 Distance 108: 1 Distance 109: 1 Distance 110: 2 Distance 112: 2 Distance 113: 1 Distance 115: 1 Distance 120: 2 Distance 128: 1 Distance 129: 1 Distance 133: 1 Distance 137: 1 Distance 142: 1 Distance 147: 1 Distance 148: 1 Distance 150: 1 Distance 151: 1 Distance 156: 1 Distance 163: 1 Distance 170: 1 Distance 173: 1 Distance 175: 1 Distance 186: 1 Distance 187: 1 Distance 198: 1 Distance 207: 1 Distance 212: 1 Distance 218: 1 Distance 230: 1 Distance 241: 1 Distance 242: 1 Distance 251: 1
And the FreeBASIC code for the unigram distance statistic.
function m_unigramdistance(array()as integer,byval l as integer,byval s as integer)as double dim as integer i,j,score dim as short aud(s,l) for i=1 to l aud(array(i),0)+=1 aud(array(i),aud(array(i),0))=i next i for i=1 to s for j=1 to aud(i,0)-1 score+=aud(i,j+1)-aud(i,j) next j next i return score end function
Hi Jarlve,
I am not sure if the unigram distance is really curious. I think this depends only on the length of the used key and the underlying plaintext. For example take the first 20 lines from z408 and encode it with a longer key (e.g. the default one from peek-a-boo). Here is the result:
Plaintext:
ILIKEKILLINGPEOPL EBECAUSEITISSOMUC HFUNITIAMOREFUNTH ANKILLINGWILDGAME INTHEFORRESTBECAU SEMANISTHEMOATDAN GERTUEANAMALOFALL TOKILLSOMETHINGGI VESMETHEMOATTHRIL LINGEXPERENCEITIS EVENBETTERTHANGET TINGYOURROCKSOFFW ITHAGIRLTHEBESTPA RTOFITIATHAEWHENI DIEIWILLBEREBORNI NPARADICEANDALLTH EIHAVEKILLEDWILLB ECOMEMYSLAVESIWIL LNOTGIVEYOUMYNAME BECAUSEYOUWILLTRY
Encoded (cyclic):
h:cjajCuU5AEYxXY: 39ybdro+hLc1z=;ve FISqCw5D;Xn0Ir782 iRjhuUcAJlC:kEZ;P 5qspfI=GVgKN4abdv Mx;D7hoLF3;XiwBZR Jyt8S+dAD;iu=IZU: sXjcuU1=;0N2CqEJ5 QPz;fLpg;Xdw8Fnh: uc7Ea-YxG3ReyCs5K +Q0A9PNLfVw2DqJg8 sh7ET=rtnXbjM=IIl cNpiJCGULFa4xowYZ V8XI5shdN2D3lpyRc HC+5lh:u90tP4=nAc qYiGZkCefd7BDU:LF g52iQajhuUxHlc:u9 3bX;y;T1UZQ+zCl5: uR=wEhQ0TXv;TAd;P 4feDSKgT=rlcU:8VT
The encoded version shows a very high unigram distance score of 15267.
What I find curious is the connection between unigram distance and transposition. If z340 is a transposed homophonic cipher then untransposing p15/19 or columns odd/even + p18 (which is basically a diagonal transposition) should rather increase the score than decreasing it.
I am not sure if the unigram distance is really curious. I think this depends only on the length of the used key and the underlying plaintext. For example take the first 20 lines from z408 and encode it with a longer key (e.g. the default one from peek-a-boo). Here is the result:
Good reasoning. Also, I was wrong that it did not correlate with sequential homophonic substitution. Though, the 340 does not have such a long key. And while trying to match the raw ioc and encoding randomness of the 340, the observation’s rarity may still be in between 1 in a 1.000 and 1 in 10.000. I would say it is at least still a bit curious.
In short, a lower ioc and more sequential homophonic substitution increase unigram distance. Randomizing the character order of your cipher many times gives an average unigram distance of about 13800 while for the 340 it is about 13302. You may normalize the ioc by dividing over the average unigram distance but that does not normalize the quality of the sequential homophonic substitution.
What I find curious is the connection between unigram distance and transposition. If z340 is a transposed homophonic cipher then untransposing p15/19 or columns odd/even + p18 (which is basically a diagonal transposition) should rather increase the score than decreasing it.
Well, since we generally assume transposition before encoding and if the unigram distance is some property of encoding then untransposing p19 should not increase it since that would disturb the encoding.
Well, since we generally assume transposition before encoding and if the unigram distance is some property of encoding then untransposing p19 should not increase it since that would disturb the encoding.
Gosh! I’ve completely forget that! Next time I’ll reconsider my text before I post
Another way to guarantee a high unigram distance is to start with what Largo calls a long key, or optimal suppression of frequencies and more than 63 symbols, say 70+, and merge symbols afterwards. I kinda like this one but it does not randomize symbol cycles by default and depending on the amount of symbols merged the cipher could still be solved.
Here is the base cipher with 75 symbols having a unigram distance of 16779.
37 32 48 3 24 3 49 44 46 65 5 55 35 67 33 7 2 69 9 70 58 40 74 27 4 1 42 37 29 62 66 15 12 10 36 17 51 31 48 73 49 13 25 75 30 24 52 74 43 14 39 50 59 3 65 28 32 1 5 6 19 37 44 20 55 56 53 67 48 31 42 47 69 17 8 63 72 70 27 73 26 4 34 11 12 29 24 15 40 43 49 62 14 57 67 25 33 13 42 45 50 59 6 69 18 66 51 27 56 5 65 53 11 46 75 52 40 2 28 73 8 3 1 32 44 29 33 15 70 14 16 37 31 55 6 48 21 4 62 25 24 42 36 67 53 66 13 73 14 39 30 49 46 2 65 43 55 69 22 35 70 63 4 59 58 24 1 42 37 27 67 38 69 5 60 70 73 14 4 72 42 47 50 31 6 24 73 14 48 43 55 61 75 74 18 30 8 10 3 29 33 17 52 19 49 42 57 56 6 65 63 28 73 16 67 71 69 62 14 7 11 72 42 66 17 1 73 37 13 14 36 40 42 19 39 70 59 48 64 49 4 65 19 1 32 44 9 24 18 67 26 75 30 5 37 31 35 50 63 56 20 48 34 69 11 43 45 40 46 2 73 47 70 49 57 50 54 4 3 65 28 32 24 64 19 1 44 46 60 67 58 8 15 69 25 68 27 2 56 21 70 29 37 19 48 28 32 59 33 14 55 49 38 4 23 66 12 53 41 5 11 15 24 71 67 10 40 51 62 69 61 75 74 19 65 44 46 42 72 68
And here is the base cipher after merging to 63 symbols with a unigram distance of 15228, a tad higher than the 340.
64 32 48 3 68 3 47 44 46 65 5 55 35 23 25 7 2 54 9 70 58 40 74 29 4 1 25 64 29 62 66 15 12 10 36 17 51 31 48 73 47 13 25 75 30 68 52 74 43 14 39 50 25 3 65 28 32 1 5 6 19 64 44 20 55 56 53 23 48 31 25 47 54 17 8 63 72 70 29 73 26 4 34 11 12 29 68 15 40 43 47 62 14 57 23 25 25 13 25 45 50 25 6 54 18 66 51 29 56 5 65 53 11 46 75 52 40 2 28 73 8 3 1 32 44 29 25 15 70 14 16 64 31 55 6 48 21 4 62 25 68 25 36 23 53 66 13 73 14 39 30 47 46 2 65 43 55 54 22 35 70 63 4 25 58 68 1 25 64 29 23 38 54 5 38 70 73 14 4 72 25 47 50 31 6 68 73 14 48 43 55 71 75 74 18 30 8 10 3 29 25 17 52 19 47 25 57 56 6 65 63 28 73 16 23 71 54 62 14 7 11 72 25 66 17 1 73 64 13 14 36 40 25 19 39 70 25 48 64 47 4 65 19 1 32 44 9 68 18 23 26 75 30 5 64 31 35 50 63 56 20 48 34 54 11 43 45 40 46 2 73 47 70 47 57 50 54 4 3 65 28 32 68 64 19 1 44 46 38 23 58 8 15 54 25 68 29 2 56 21 70 29 64 19 48 28 32 25 25 14 55 47 38 4 23 66 12 53 68 5 11 15 68 71 23 10 40 51 62 54 71 75 74 19 65 44 46 25 72 68
Additional observations on unigram distances and locations from from some work I have been doing is the "middle" of the 340 seems to have anomalies in it besides the pivots and the only time you see an A-B-A type pattern (+ b + on line 9).
I find it interesting that if you attempt to divide the 340 into thirds by lines, say first 7, second 7, last 6- the middle 7 lines only use 50!?!?! of the 63 possible symbol choices with two unique to those lines, the backwards B (b) appears 3 times and is the square with the dot in it appears once. They do not appear in the first 7 or last 6 lines. I don’t know if limiting the number of symbols increases the likelihood of getting the pivot patterns?
Also, There are 11 symbols that appear in the top 7 and/or bottom 6 lines.
W- 6 times
C- 5 times
)- 5 times (circle with horizontal line)
k- 5 times
>- 4 times
P- 3 times
H- 3 times
1- 3 times (bottom shaded circle)
:- 2 times (weird I with dot on right)’
X- 2 times
%- 2 times* (right shaded square) *only appears in top
I would expect with a truly flat "homophonic" distribution, each symbol should appear ~5 times. The unigrams clearly are not distributed evenly dependent on the underlying text. Perhaps this could provide insight into the enciphering method or transposition methods, including which came first- enciphering or transposing. My guess is this suggests enciphering though I am not yet sold on homophonicism… (can I make up a word?).
Final note, 11 of the 24 + symbols are in the middle 7 lines. It may not be statistically relevant, but just another observation.
-marie
The problem when solved will be simple– Kettering
Imo it still might be a homophone substitution but without repeating sequences (instead at least partially sectional use of homophones). More symbols (not only letters) than alphabetical letters somehow points to that conclusion as you couldn’t read any cleartxt consisting of symbols instead of letters..
The ) might be part of the E due to its high frequency in some parts of the cipher.
QT
*ZODIACHRONOLOGY*
Might, what a great word. Could be. E is a confusion, and w/o knowing plaintext. I just see too much fumbling to make it true.
JMHO.
-m
The problem when solved will be simple– Kettering
Additional observations on unigram distances and locations from from some work I have been doing is the "middle" of the 340 seems to have anomalies in it besides the pivots and the only time you see an A-B-A type pattern (+ b + on line 9).
Related to the top-down symmetry of these symbols:
This is an interesting topic and I did some shuffle studies. I ran each cipher (z408, z340, largo’s test cipher, and jarve’s two test ciphers) through ten million shuffles and generated stats for unigram distance:
Headings:
– Min: Smallest unigram distance observed over all shuffles
– Max: Largest unigram distance observed over all shuffles
– Mean: Average unigram distance observed over all shuffles
– Median: Median of all unigram distances observed over all shuffles
– Std dev: Standard deviation of unigram distance observed over all shuffles
– Actual: The actual unigram distance observed for the given cipher
– Sigma: Number of standard deviations the observed unigram distance is from the mean unigram distance over all shuffles
– Hits: Number of times during shuffles that a unigram distance was observed having an equal or better score than the measurement for the unshuffled cipher
– Shuffles per hits: Average number of shuffles needed to achieve the same or better measurement as the actual measurement
So, you can see that z340’s unigram distance is much more significant than z408’s unigram distance. Largo’s is better than z408’s but not as significant as z340. And Jarlve’s two ciphers both have very improbable unigram distances since their measurements were never reached during shuffles.
I wonder: If some step has diffused/muted the cycling effect (but not eliminated it altogether), would correctly unravelling the step result in higher cycle scores AND higher (or more statistically significant) unigram distances? It seems like it is easy to find rearrangements of Z340 that significantly affect the cycling scores. An example I plan to give in my talk next week is this simple swap of rows:
HER>pl^VPk|1LTG2d Np+B(#O%DWY.<*Kf) 2<clRJ|*5T4M.+&BF z69Sy#+N|5FBc(;8R lGFN^f524b.cV4t++ |FkdW<7tB_YOB*-Cc >MDHNpkSzZO8A|K;+ (G2Jfj#O+_NYz+@L9 d<M+b+ZR2FBcyA64K -zlUV+^J+Op7<FBy- U+R/5tE|DYBpbTMKO By:cM+UZGW()L#zHJ Spp7^l8*V3pO++RK2 _9M+ztjd|5FP+&4k/ yBX1*:49CE>VUZ5-+ |c.3zBK(Op^.fMqG2 RcT+L16C<+FlWB|)L ++)WCzWcPOSHT/()p p8R^FlO-*dCkF>2D( #5+Kq%;2UcXGV.zL|
One of my measurements (a computation of the average statistical significance of individual cycles) gives this rearrangement a score that is double that of the unmodified Z340. But unigram distance is 14396 which is a little smaller than that for the unmodified Z340. Maybe because this rearrangement is a false positive. Could unigram distance significance give us a way to filter out false positives?
Are you sure about the above rearrangement of rows? Because I tried it and got very slightly lower cycle score overall.
Thanks for checking. I think the reason the modified cipher scores highly for me is because its cycles are more statistically significant, rather than simply being more numerous.
Here’s my methodology:
1) Generate stats for a bunch of shuffles, for all cycles of length 2.
2) Stats are collected for the "runs" of cycles. Example: "AAA AB AB AB BAB" has a max run length of 3 because of [AB] [AB] [AB].
3) In the shuffles, average and standard deviation of max runs are computed for every pair of symbols.
4) Then, actual measurements of a target cipher’s cycles are compared to the shuffle stats.
5) The target cipher will have observed runs for each pair of symbols. The observed runs are compared to the mean during shuffles. Then I count how many standard deviations (sigma) the observed measurement is from the average runs observed during shuffles.
6) For each symbol pair, I compute sigma. So, this gives a relative sense of the significance of each observed cycle.
7) Then I compute the average of all the observed sigma values (over all symbol pairs).
I believe using standard deviation from the shuffle stats has a normalization effect on the cycle measurements. This helps filter out some of the noise of false cycles.
Computed this way, the row-swapped Z340 has about double the average sigma as the unmodified Z340. You can see this looking at individual cycles, where the overall relative probabilities are, on average, less than those of the Z340. Here is the raw output of cycles of both ciphers for comparison:
https://docs.google.com/spreadsheets/d/ … sp=sharing
It shows all L2 cycles detected for both ciphers, side by side, in decreasing order by estimated probability. When you scroll down, you’ll notice an emerging trend for cycles to become more improbable in the row-swapped Z340. Look at the "How much more improbable" column. Positive values indicate an increase in improbability. There are also 172 extra cycles in the row-swapped Z340 compared to the original.
So, I guess this boils down to a philosophical question: Can a cipher be considered "more" homophonic if it doesn’t really show that many more long cycles BUT has more improbable cycles overall?
All I know is that the "average sigma" measurement scores very high for Z408 compared to Z340 (0.68 vs 0.21).
Good luck with your talk next week doranchak!
Thank you for the test, the results are very clear. Rearrangement of rows is a possible cause for the high unigram distance in the 340 as it could either decrease or increase the value. Your test suggests that the value of the 340 is however quite rare and therefore more unlikely to have been caused by rearrangement of rows, right? Your example cipher decreases the unique sequence length 17 repeats from 26 to 18 and since we know that is a very significant observation perhaps you could use it to filter out false positives.
Here are my 2 most likely hypotheses for the high unigram distance in the 340:
1. The high unigram distance of the 340 is related to the group of symbols that do not appear in the middle 7 rows (as marie said).
2. A long key (as Largo said) is used in the homophonic substitution process and some of the most frequently occuring symbols are for whatever reason not part it or are wildcards. Likely hypotheses for the symbols that were not taken up in the homophonic substitution or are wildcards are that these are 1:1 substitutes, plaintext nulls or wildcards (as smokie said). I would like to go as wide as possible with the interpretation of wildcards.
In your test my cipher jarlve2 very closely matches the 340 which is hypothesis 2. Some things to look in may be exotic cycling types (again) such as palindromic cycling which also increases unigram distance and not trying to repeat symbols in a certain view window as opposed to actively cycling homophones.
hi smokie nice work. maybe mark up the pivots in a separate spreadsheet for any visual patterns as well