Here is the conclusion of a large language test conducted on both the Zodiac 340 and 408.
1. From the Wortschatz library, for each listed language, a 1M (1 million) sentence file was downloaded if available. http://corpora2.informatik.uni-leipzig.de/download.html
2. A newly programmed generic ngram creator then processed all the sentence files equally and outputted 5-gram statistics for AZdecrypt to work with. All punctuation was removed and letters had to have an occurance of at least 0.2% (1 in 500) to be included in the ngram alphabet. Lower and uppercase letters were treated as different letters since it would be very hard to map these out for all the foreign languages, this does not matter much though. Since 1M sentence files were used, at least 100 megabyte of language information went into each ngram file to meet a quality standard.
3. A cipher is run through all these languages with a newly programmed AZdecrypt identify language feature. AZdecrypt is a powerful substitution solver that can easily solve ciphers such as the beale 2. 50 restarts with a randomized cipher are made to gather an average score. And then another 50 restarts are made without randomization while recording the highest score. The difference between the average and the highest score then determines the score for that language, for example: 110.58%.
Important to note:
4. Score inflation is going on with languages that have a larger alphabet. Therefore, the alphabet size of each language was included in brackets, for example: (26).
5. Some of these ngrams files were tested individually and worked very fine to solve difficult ciphers such as smokie’s purple haze message. Though it is possible that some of the ngrams with a very large alphabet may not work as well.
Interpretation of the results:
The results for the 408 show a preference to English from the South African region. And the 340 shows no significant preference to any language at all, more or less the ngrams with larger alphabets drift up (see point 4). I’m working to release a new AZdecrypt that includes the identify language feature.
AZdecrypt identify language for: 408.txt -------------------------------------------------- english(southafrica): 141.72% (26) english(newzealand): 139.87% (27) english(unitedkingdom): 138.62% (27) english(england): 138.32% (27) english(canada): 138.05% (26) english(austria): 137.69% (27) english(europe): 136.78% (28) english(fiji): 135.25% (31) korean(wikipedia): 128.53% (69) swedish(sweden): 120.59% (27) icelandic(iceland): 120.05% (31) ukrainian(ukraine): 119.52% (32) czech(czechrepublic): 118.62% (29) tamil(mixed): 118.23% (33) belarusian(belarus): 117.75% (33) hindi(india): 117.50% (45) austriangerman(austria): 117.37% (37) tatar(mixed): 117.36% (36) hungarian(wikipedia): 117.30% (31) german(switzerland): 117.13% (37) czech(europe): 116.94% (28) german(germania): 116.93% (39) albanian(albania): 116.76% (28) greek(greece): 116.68% (33) indonesian(mixed): 116.49% (25) indonesian(indonesia): 116.36% (31) russian(moldavia): 116.20% (32) russian(kazakhstan): 116.18% (32) slovak(slovakia): 116.18% (30) armenian(armenia): 116.10% (39) russian(russia): 115.57% (31) russian(azerbaijan): 115.26% (32) polish(wikipedia): 115.01% (27) danish(denmark): 114.81% (27) french(france): 114.59% (28) slovenian(slovenia): 114.56% (23) esperanto(mixed): 113.91% (24) spanish(ecuador): 113.76% (31) spanish(honduras): 113.72% (29) bulgarian(wikipedia): 113.60% (30) traditionalchinese(china): 113.60% (61) catalan(catalonia): 113.59% (30) norwegianbokmal(norwegia): 113.57% (25) japanese(japan): 113.53% (47) spanish(costarica): 113.48% (30) spanish(uruguay): 113.38% (32) portuguese(brazil): 113.28% (32) italian(mixed): 112.95% (23) portuguese(portugal): 112.73% (29) norwegian(norwegia): 112.71% (27) spanish(columbia): 112.69% (30) portuguese(wikipedia): 112.65% (30) estonian(estonia): 112.63% (26) spanish(guatemala): 112.56% (29) spanish(wikipedia): 112.38% (32) croatian(wikipedia): 112.23% (24) bosnian(bosniaherzegovina): 112.00% (21) cebuano(cebuano): 111.69% (22) dutch(netherlands): 110.52% (24) arabic(wikipedia): 110.50% (34) azerbaijani(azerbaijan): 110.49% (29) lithuanian(lithuania): 109.95% (21) romanian(romania): 109.83% (23) turkish(turkey): 109.75% (28) vietnamese(vietnam): 109.67% (35) hebrew(wikipedia): 109.47% (25) latvian(latvia): 109.36% (21) persian(wikipedia): 109.24% (31) moldavian(moldavia): 109.23% (24) african(southafrica): 108.92% (24) persian(iran): 108.84% (33) chinese(china): 108.55% (31) finnish(finland): 107.78% (21) AZdecrypt identify language for: 340.txt -------------------------------------------------- japanese(japan): 110.58% (47) hindi(india): 109.47% (45) tamil(mixed): 109.47% (33) korean(wikipedia): 109.26% (69) tatar(mixed): 109.00% (36) armenian(armenia): 108.42% (39) indonesian(indonesia): 107.77% (31) english(fiji): 107.29% (31) english(newzealand): 107.25% (27) polish(wikipedia): 107.24% (27) belarusian(belarus): 107.14% (33) czech(czechrepublic): 107.12% (29) english(europe): 106.78% (28) indonesian(mixed): 106.73% (25) bulgarian(wikipedia): 106.68% (30) norwegian(norwegia): 106.63% (27) swedish(sweden): 106.62% (27) austriangerman(austria): 106.41% (37) icelandic(iceland): 106.38% (31) catalan(catalonia): 106.23% (30) albanian(albania): 106.21% (28) danish(denmark): 106.19% (27) ukrainian(ukraine): 106.03% (32) spanish(ecuador): 106.01% (31) russian(moldavia): 105.84% (32) norwegianbokmal(norwegia): 105.83% (25) slovak(slovakia): 105.74% (30) russian(azerbaijan): 105.62% (32) greek(greece): 105.61% (33) english(unitedkingdom): 105.52% (27) french(france): 105.48% (28) english(canada): 105.43% (26) czech(europe): 105.35% (28) spanish(costarica): 105.33% (30) spanish(wikipedia): 105.33% (32) portuguese(brazil): 105.30% (32) hungarian(wikipedia): 105.29% (31) english(austria): 105.24% (27) english(southafrica): 105.06% (26) spanish(honduras): 105.06% (29) english(england): 104.91% (27) bosnian(bosniaherzegovina): 104.81% (21) arabic(wikipedia): 104.79% (34) russian(kazakhstan): 104.77% (32) persian(iran): 104.68% (33) italian(mixed): 104.67% (23) portuguese(wikipedia): 104.67% (30) portuguese(portugal): 104.63% (29) spanish(uruguay): 104.63% (32) german(germania): 104.61% (39) spanish(guatemala): 104.59% (29) dutch(netherlands): 104.49% (24) slovenian(slovenia): 104.49% (23) romanian(romania): 104.48% (23) moldavian(moldavia): 104.46% (24) esperanto(mixed): 104.44% (24) german(switzerland): 104.34% (37) russian(russia): 104.34% (31) spanish(columbia): 104.29% (30) hebrew(wikipedia): 104.18% (25) turkish(turkey): 104.10% (28) persian(wikipedia): 103.90% (31) latvian(latvia): 103.87% (21) lithuanian(lithuania): 103.83% (21) vietnamese(vietnam): 103.78% (35) traditionalchinese(china): 103.76% (61) croatian(wikipedia): 103.74% (24) azerbaijani(azerbaijan): 103.71% (29) estonian(estonia): 103.66% (26) cebuano(cebuano): 102.94% (22) african(southafrica): 102.88% (24) chinese(china): 102.32% (31) finnish(finland): 102.28% (21) Randomized 340 for control: AZdecrypt identify language for: 340.txt -------------------------------------------------- korean(wikipedia): 109.01% (69) hindi(india): 106.36% (45) tatar(mixed): 105.27% (36) english(southafrica): 105.11% (26) japanese(japan): 105.05% (47) hungarian(wikipedia): 104.94% (31) traditionalchinese(china): 104.39% (61) icelandic(iceland): 104.25% (31) persian(iran): 104.09% (33) azerbaijani(azerbaijan): 104.00% (29) danish(denmark): 103.84% (27) german(switzerland): 103.80% (37) polish(wikipedia): 103.79% (27) french(france): 103.76% (28) finnish(finland): 103.62% (21) belarusian(belarus): 103.55% (33) bulgarian(wikipedia): 103.48% (30) english(fiji): 103.41% (31) chinese(china): 103.33% (31) african(southafrica): 103.22% (24) russian(russia): 103.22% (31) greek(greece): 103.13% (33) norwegian(norwegia): 103.11% (27) albanian(albania): 103.09% (28) swedish(sweden): 102.99% (27) czech(czechrepublic): 102.97% (29) persian(wikipedia): 102.95% (31) english(unitedkingdom): 102.88% (27) spanish(columbia): 102.88% (30) portuguese(brazil): 102.87% (32) portuguese(wikipedia): 102.85% (30) spanish(uruguay): 102.84% (32) turkish(turkey): 102.83% (28) norwegianbokmal(norwegia): 102.82% (25) arabic(wikipedia): 102.81% (34) cebuano(cebuano): 102.80% (22) tamil(mixed): 102.80% (33) vietnamese(vietnam): 102.79% (35) spanish(wikipedia): 102.70% (32) spanish(honduras): 102.67% (29) spanish(costarica): 102.64% (30) esperanto(mixed): 102.58% (24) italian(mixed): 102.58% (23) spanish(guatemala): 102.58% (29) spanish(ecuador): 102.54% (31) ukrainian(ukraine): 102.42% (32) english(england): 102.41% (27) bosnian(bosniaherzegovina): 102.33% (21) estonian(estonia): 102.33% (26) russian(kazakhstan): 102.33% (32) armenian(armenia): 102.27% (39) english(europe): 102.27% (28) catalan(catalonia): 102.26% (30) hebrew(wikipedia): 102.20% (25) latvian(latvia): 102.12% (21) slovak(slovakia): 102.09% (30) portuguese(portugal): 102.06% (29) dutch(netherlands): 102.03% (24) russian(moldavia): 101.96% (32) czech(europe): 101.89% (28) english(newzealand): 101.76% (27) slovenian(slovenia): 101.71% (23) english(canada): 101.63% (26) indonesian(indonesia): 101.63% (31) austriangerman(austria): 101.59% (37) russian(azerbaijan): 101.58% (32) german(germania): 101.54% (39) indonesian(mixed): 101.49% (25) english(austria): 101.46% (27) moldavian(moldavia): 101.30% (24) lithuanian(lithuania): 101.00% (21) romanian(romania): 101.00% (23) croatian(wikipedia): 100.81% (24)
Really nice work, Jarlve. Thanks for building this very useful feature! Do you plan to include any other language-related statistics in the language detection, such as IoC, chi^2, etc?
Really nice work, Jarlve. Thanks for building this very useful feature! Do you plan to include any other language-related statistics in the language detection, such as IoC, chi^2, etc?
Hey doranchak,
Probably not as I don’t see how these statistics could increase its effectiveness. Ioc is already built in the AZdecrypt solver as a weight. The main objective is ofcourse to identify the language for substitution ciphers. I’ve tested it succesfully versus a danish, dutch, russian, french and german 340 style cipher.
I have generalized this feature in AZdecrypt as "Batch ngrams" so that various ngram packages can be tested. Languages is one of them, another could be a package that includes odd stuff like names, spaces and numbers.
That’s great that your method is so effective without requiring more complex statistics. Well done!
That’s great that your method is so effective without requiring more complex statistics. Well done!
It uses 5-gram letter statistics as intelligence.
doranchak,
I’d like to restate that I no longer want me or any of my work to be documented at any of your websites or presentations or by anyone else for that matter. I want to keep things local to the forum and in general enjoy a more casual feeling when working on the ciphers. If at any time I want to expose my work to larger audience I’ll do so myself. I’ve mentioned it in the other thread but you have not replied so you might have missed it.
I did see your post on the other thread but have been slow in catching up with my replies there.
Is it OK with you if I mention periodic bigram repeats and the non-repeating sequence peak of 26 at length 17 without talking about you specifically? I feel it is very important for the codebreaking community to know about those anomalies in Z340. They may be the very clues that someone might eventually use to find the solution once and for all. My talk will be rather naked without those important details.
doranchak,
Thanks for the reply. Yes, you can use it. I can’t ask you not to use observations and I won’t.
I appreciate that, Jarlve. Is there anything specific on my site that you wish for me to remove?
doranchak,
No. But I’d like to be informed when you push a change that contains my handle or a link to one of my posts. So that I can alter how my work is presented. Allot of the times I make quick and loose observations that I regard as bouncing ideas of eachother. It makes me uneasy that my work at any time can be documented when I actually prefer to work in a very loose and casual style. Thanks for addressing my concerns.
Understood. WIll do. Thanks.