AZdecrypt code speedup suggestions

beijinghouse

(@beijinghouse)

Posts: 34

Eminent Member

Topic starter

[ This post is mostly meant for Jarlve ]

First off, I’m very impressed with your AZdecrypt software Jarlve! I’ve only read a bit of the forum so far so I’m not totally clear on how you’re implementing beam search or exactly how you’re generating and scoring candidate solutions, but I can certainly appreciate the results. Your integration of so many transformation and batching options is great too. Thanks for creating, sharing, and maintaining such a wonderful tool!

I’ve been using it a bit the past few weeks and besides the great results, I’ve also noticed a couple things that could probably give it dramatically faster loading and processing times.

(Of course, forgive me if my advice isn’t quite right. I’m obviously making many assumptions about how things are done in your code without having it in front of me. And I’m mainly a C/C++/python developer. So I spent a few days re-learning BASIC so I could describe these modifications. But please forgive if my FreeBASIC has syntax errors or something like that. I’m an experienced programmer and my general advice is correct — but if you see anything fishy with my FreeBASIC presentation, it’s probably a mistake and not something clever.)

FWIW, this feedback comes mainly because I’ve been building my own ngram files with a ~3TB corpus. That’s how I’ve run into these performance bottlenecks. I’m definitely glad your ngram format is so transparent and easily extensible. Hopefully I can share the files I’m building if anyone here is interested.

Anyway, it seems obvious that AZdecrypt allocates a full, flat array for the entire keyspace of each ngram it works with.

1-4 grams: This is a great choice. All the entries in these arrays have at least some weight. And these tables are small, fast, and easy to implement.

5-grams: Still an OK choice. To be fair, a large enough corpus can literally fill 100% of all possible 5-grams with non-zero logarithmic weight.

But let's consider the density of your largest Reddit n-gram files:
5-grams: 74.4%   =    8,834,667   / 11,881,376
6-grams: 25.1%   =    7,544,234   / 308,915,776
7-grams: 2.0%	 =  165,568,040   / 8,031,810,176

Or even my currently densest n-gram files:
5-grams: 100.0%  =   11,881,376   / 11,881,376
6-grams: 37.1%   =  114,659,798   / 308,915,776
7-grams: 2.3%	 =  188,212,577   / 8,031,810,176
8-grams: 0.07%   =  151,397,312   / 208,827,064,576

I saw over on the FreeBasic forum that you’re implementing your own trie to reduce memory consumption?? That’s incredibly metal! You must be learning a lot! But while it’s definitely impressive, it does make me wonder.

As you may know, a programmer’s three greatest virtues are laziness, impatience, and hubris. I definitely respect the hell out of your hubris. I’m too wizened and weary to implement my own trie. But one programmer to another, I do question your lack of laziness and impatience. Where is your commitment to the 1st and 2nd virtues?? Don’t you want AZdecrypt to run faster?? Don’t you want to do less of your own programming??? There’s a lazier and more impatient version of you trapped within. Reach deep. You can summon him. I believe.

Incidentally, the guys over on the FreeBasic forum giving you advice have amazing intuitions. Who suggests Judy Arrays?! Like, for anything?? As an outsider to your community over there, I wouldn’t have imagined FreeBASIC programmers just casually dealing in such obscure, hyper-optimized data structures.

But even as someone who shares your unhealthily high interest in the pure, platonic ideal of programming your own Judy Arrays, I can’t help but wonder that if the guys over on the FreeBASIC forums who suggested those and trie’s to you might have nudged you in a slightly less esoteric and less theoretically maximally pure direction if they had realized just how sparse your data sets truly are (especially as n-gram size increases).

Sure, 1-5 grams can get (approximately) fully filled. But even the fattest 6-gram file should never contain over half of the total keyspace. Even if you stream the library of congress thru it.

[ As an aside, most 6+ grams should be even sparser in theory than they end up being in practice. To understand why, imagine isolating all the "validgrams" in a keyspace — the set of all ngrams that can be reached by combining any number of English words. For instance, GAME + LOVER can form AMELO (a 5-valid-gram). But no combo of English words can reach XTQQZ. Even with the most liberal dictionary files that include most proper nouns, names, misspellings, Briticisms, etc… you still only end up with ~40% of the key space being validgrams at n=6… and less than 9% at n=7! I haven’t calculated 8-gram validgram density yet. It’s a touch more complicated to calculate but totally doable. Maybe once 8-grams become more usable, I’ll characterize that keyspace too and write up another post about it?]

Anyway, the best part about the flat array structure you’re using right now (beyond simplicity and raw lookup speed) is how you "save" memory by not storing the ngrams themselves. You implicitly store ngrams as the index into your array by enumerating your keyspace in a logical, sequential fashion. It looks like you’re already ennumerating ngrams –> unsigned longs which is great.

And this strategy works wonderfully up thru 5-gram (and maybe even 6-grams). But it’s sluggish on 7-grams and can’t function at all on 8-grams unless you have 200GB of memeory. And even with 200GB of memory, I still need almost an hour to load an 8-gram file and then the cycle times are much too slow to be practical.

But a quick fix to make the current 8-gram implementation usable (assuming you have over 200GB of ram) and to speed up all the smaller grams as well is to simply allocate the memory more intelligently.

I noticed you posted this test harness version of your lookup table over on the FreeBASIC forums:

type nv
   ngram as ulongint
   score as ubyte
end type

dim as ulongint h,i,j,k,r,s
dim as ulongint lookups=10000
dim as ulongint array_size=1000000 'up to 500,000,000
dim shared as nv array(array_size)

—
Quick memory fix for 6-8 grams
—

Based on this and also the loading speed of 8-grams (and even 6 / 7 grams), it’s likely you’re not allocating your own memory for the array off the heap. Instead, FreeBASIC is allocating you memory off the stack. Obviously, you’re specifying the array size, but there are multiple ways to do memory allocation and the default one (which works well for 99% of cases) is unfortunately the least efficient when you’re dealing with really large chunks of data. If you don’t work with big data, this sort of memory management quirk doesn’t really come up.

It’s kind of inside baseball for programmers but the big problem with this is that FreeBASIC is going to end up doing two very wrong things:

1) FreeBASIC will allocate all your array memory on the stack. Stack memory is typically limited to 1MB / process. Exceeding this limit normally crashes programs. But sometimes it’s only slows them down to a crawl as a compromise between that and crashing. The OS’s memory manager is crying on the inside as this happens (it’s quiet but you can hear it if you listen close). It’s sort of a miracle this much memory allocation off the stack succeeds at all. It’s probably a combination of FreeBASIC being slow enough and Windows being forgiving enough to allow this combo. If I did this with C / Linux, it would almost certainly allocate stack so fast that Linux would (mercifully) segfault to warn me that something was going wrong. This is probably also related to why you mentioned somewhere that ngram files should only have < 1MB per line. Why? Is it because AZdecrypt sometimes hangs or crashes if that limit is exceeded? That’s likely another manifestation of the same issue where stack memory grows too fast or too large and the OS panics and throttles everything super slow as the file buffer continuously overflows and gets automatically re-allocated over and over by FreeBASIC’s memory management. (You could make load times much faster and remove this 1MB limitation by manually allocating your own file buffer. It’s a 2 line fix. I’ll mention other tweaks along with this in my next post.)

2) FreeBASIC won’t actually give you all the memory you ask for by default. For example, currently, when you ask for 1,000,000 nv elements, behind the scenes, FreeBASIC actually only allocates you 1 page of memory and gives you a pointer to it. That’s 4kb ~ aka 56 nv structures. So when you insert the 57th element in your array, FreeBASIC interrupts your program for 10,00-100,000x CPU cycles to grab you a 2nd page of memory. I’m not positive how pathologically bad FreeBASIC’s default stack allocation algorithm is, but most languages have atrociously naive stack expansion strategies (since they are used so infrequently by "normal sized" programs). It’s likely that a mere 50MB 8-gram is currently being interrupted 3.7 billion times while it slowly loads 99.93% zeros into a ~200GB chunk of memory, 56 ngrams at a time.

Amazingly, the simplest way to fix this in FreeBASIC only requires 3 extra words[!]:

dim shared as nv array(array_size)


--- should become ----


Redim shared as nv array(array_size)

... '' process array

Erase array

See the post below on the FreeBASIC forum for more color on this issue. It also includes three other, slightly more complex options for implement your own memory management which should also fix this stack / resizing issue. I expect one of these methods will work and give a brisk speed up, but just to be clear, I don’t actually use FreeBASIC regularly and haven’t personally tested this. So one of the other methods could possibly end up working better:
https://www.freebasic.net/forum/viewtopic.php?t=26709

Apologies for the long post. This only fixes the simplest performance bottleneck I found. I’ve also figured out and written up a solution to let ngram files load 5-20x faster and found two drop in replacements for your array structure that hold the 7/8 gram arrays in FreeBASIC with substantially less memory and presumably better processing velocity. The processing speed up will be roughly proportion to data sparsity, so on the order of ~5-10x less memory and some correlated speed of that size for 7-grams, and ~500-1000x less memory usage and a similar fraction of processing speed up for 8-grams. Happy to post those solutions too if there’s interest.

Posted : January 1, 2019 11:27 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Hey beijinghouse,

Thanks for all the feedback and happy newyear!

FWIW, this feedback comes mainly because I’ve been building my own ngram files with a ~3TB corpus. That’s how I’ve run into these performance bottlenecks. I’m definitely glad your ngram format is so transparent and easily extensible. Hopefully I can share the files I’m building if anyone here is interested.

That is insane, I thought my 788GB Reddit corpus was large!

Anyway, it seems obvious that AZdecrypt allocates a full, flat array for the entire keyspace of each ngram it works with.

Yep. The n-gram arrays in AZdecrypt are declared like this "static shared as ubyte q5g()" and resized when needed. And as you have found out, I have also been working on another system that uses memory more efficiently with support for up to 13-grams that uses successive approximation and a trie to speed it up. I have 8-grams up and running this way. It still uses allot of memory but have touched very little on it.

To summarize, I think you are suggesting the following:

1) Speed up loading times of n-grams
2) Speed up and reduce memory usage of the successive approximation approach (which is not live yet)

Correct?

AZdecrypt

Posted : January 1, 2019 4:59 pm

beijinghouse

(@beijinghouse)

Posts: 34

Eminent Member

Topic starter

My advice is more specific. I’m saying you should make sure you switch to grabbing memory off the heap in 1 go, instead of letting FreeBASIC give it to you from the stack. Even if you have successfully deployed a trie and it’s using way less memory than the flat array, you’ll still get *much* better performance using your own heap memory.

I’m not positive the Redim / Erase is the absolute best FreeBASIC convention for you though. It’s technically for specifying dynamically sized arrays. Since your array isn’t resized (except on each load), I would also try using the New / Delete semantics for the array variable specification (this replaces "dim"). It may make the program "stutter" a second when you grab a huge page of memory, but then it will run much faster the rest of the time you are filling it.

My other suggestion I didn’t post yet (and perhaps it’s not needed anymore if you have the trie working super well.) To be honest, I saw you talking about writing a custom trie from back in May and assumed that it wasn’t working so well since you hadn’t updated AZdecrypt with it yet. Here was my proposal for a simpler method that may be easier and give you better performance:

Better data structure for 7+ grams
—

If you want to break through the performance bottleneck of larger ngram files and be maximally memory efficient, why not switch to sparse_hash?

It’s got a similar memory footprint to a Judy Array but is not as complex a data structure.

I won’t go into the minutia of why I suggest sparse_hash over other maps/hashes. But if you’re interested, I recommend:
https://research.neustar.biz/2011/11/27 … parsehash/
https://smerity.com/articles/2015/googl … ehash.html

FreeBASIC doesn’t have any native hashmap types at all… but that’s probably for the best. Even if you found one, it would inevitably be super slow if written in pure FreeBASIC. To be fair, even in a higher performance language, I would still only consider using a pure C (or asm) hash library. And you can do the same in FreeBASIC.

There’s a SWIG converter to transpose C library headers into FreeBASIC headers. And if you compile a library correctly, you should be able to import it directly into your FreeBASIC code as a shared library that you call just like any other set of external functions.

Converting C Headers with swig
https://www.freebasic.net/forum/viewtopic.php?t=13829

Actual link to SWIG FB converter
https://www.freebasic-portal.de/dlfiles/70/swig_fb.zip

I used this converter already and made Google’s experimental C version of their sparse_hash headers into a FreeBASIC header/interface:
https://rockstarresearch.com/boinc/libchash.bi

I also found a partially converted interface for the Judy Arrays if you want to try it:
https://rockstarresearch.com/boinc/judy.bi

Two small caveats though:

1) I stomped the "FILE" typedef in the sparse_hash header just to make it compile quickly. "FILE" is an opaque type def and I don’t know enough about it (or FreeBASIC typing in general) off the top of my head to make this work as intended. You might eventually want to try fixing this rough patch and actually use the 3 file handling functions that depend on it. But to get started, you can clearly use the rest of this hash implementation without this and just read and load your own data via the same FreeBASIC routines you’ve been using so far. But it’s definitely dreamy to imagine using these custom C hooks directly in the hash table to straight read / write raw hash tables to and from disk for near-instant hash table fills.

2) The Judy interface was something I found online and the macros are totally borked. It seems like other parts are janky too. But I think the core functions work and the main negative is a bunch of boilerplate warnings when you call certain functions with odd casting. I’m not a FreeBASIC surgeon or I’d triage this file more. Maybe you can see what’s messed up and fix it easily.

Feel free to use these interfaces. You could also re-make your own with the instructions below or integrate a completely different C-enabled data structure:

Converting C Headers with swig
https://www.freebasic.net/forum/viewtopic.php?t=13829

Actual link to SWIG FB converter (it is 404 in tutorial above)
https://www.freebasic-portal.de/dlfiles/70/swig_fb.zip

I was able to compile both sparse_hash and Judy Arrays into libraries and link your FreeBASIC lookup harnesses against them and do a few calls to their interfaces without any issue. That said, I only did throw-away calls from within your test harness just to make sure it compiled correctly and could locate the right library symbols. Your algorithms are not implemented yet and I’m almost certainly passing pointers incorrectly in this test stub since I don’t know proper FreeBASIC pointer conventions. Hopefully this is still illustrative enough of how the calls to this library are supposed to work. If something is unclear, I can try to help more. There’s also a bunch on examples for each library in C that you can ape to figure out calling conventions, if the actual docs aren’t clear for either library.

Here’s some links and helpful commands to give you an idea of how I made these libraries on Linux and got them linking up with FreeBASIC code. This is mostly for sparsh_hash. For judy I just used the makefile. libjudy and libchash should both build in MinGW or whatever else you use to compile C libraries on Windows.

https://github.com/sparsehash/sparsehash/tree/master/experimental


wget https://raw.githubusercontent.com/sparsehash/sparsehash/master/experimental/libchash.c
wget https://raw.githubusercontent.com/sparsehash/sparsehash/master/experimental/libchash.h

gcc -fPIC -o libchash.o -c libchash.c
gcc -shared -olibchash.so libchash.o -lm

cp libchash.so /usr/local/lib/
sudo ldconfig

fbc lut_chash_test.bas -l chash




The judy array library I compiled is the vanilla, reference library (v1.05):
https://sourceforge.net/projects/judy/files/judy/Judy-1.0.5/




'lookup table test environment
'-----------------------------

#include "libchash.bi"


dim as HashTable ht

'ht = AllocateHashTable(1, 0) '' value 1 byte, 0 don't copy keys

dim as HTItem bck

type nv
   ngram as ulongint
   score as ubyte
end type

dim as ulongint h,i,j,k,r,s
dim as ulongint lookups=10000
dim as ulongint array_size=10000 'up to 500,000,000
dim shared as nv array(array_size)

declare sub quicksort(byval low as ulongint,byval high as ulongint)

randomize timer,4
screenres 640,480,32

print "Creating random entries.";
for i=1 to array_size

   for j=1 to 8
      k+=int(rnd*26)*(26^(j-1))
   next j
   HashInsert(@ht, k, int(rnd*256))
   array(i).ngram=k
   array(i).score=int(rnd*256)
   k=0
next i

print " Sorting.";
quicksort(1,array_size)

print " Retrieving random ngrams."
dim as double tmr=timer
for h=1 to lookups
   r=array(int(rnd*array_size)+1).ngram

   'search algorithm
   '--------------------------------------------->

   for i=1 to array_size
      HashFind(@ht, i)
      'Print bck.data
      if array(i).ngram=r then
         s+=array(i).score
         exit for
      end if
   next i

   '<---------------------------------------------

next h

print:print str(int(lookups/(timer-tmr)))+" look ups per second ("+str(s)+")"

sleep

Posted : January 2, 2019 2:22 am

Jarlve

(@jarlve)

Posts: 2547

Famed Member

Hey beijinghouse,

I was already using redim with the n-grams. The code you saw was something mashed up quickly that is not wholly representative of the underlying AZdecrypt code. I let FreeBASIC compile AZdecrypt with GCC and optimization flags.

Thank you for suggesting sparse_hash and all your work. I do not want to work on AZdecrypt right now but perhaps I could drop you a e-mail once I do? PM me your e-mail if that is okay with you.

AZdecrypt

Posted : January 2, 2019 7:23 pm

Largo

(@largo)

Posts: 454

Honorable Member

Hi beijinghouse,

I fully agree with you that AZDecrypt is a fantastic and very effective tool. Even though your post is mainly addressed to Jarlve, I would like to express my thanks. I think it’s great that you take so much time to post your suggestions. It’s always great when a lot of people work together. I worked as a software developer myself until 2017. Now I’m doing something completely different, but in my spare time I’m of course still involved with the topic. I don’t have much time at the moment, but now and then I also work on my own solver. Jarlve gave me a lot of valuable tips and advice for its implementation. If you have time and interest, you may take a look at it:

Translated with www.DeepL.com/Translator

Posted : January 3, 2019 5:47 pm

Largo

(@largo)

Posts: 454

Honorable Member

Oops, I forgot to post the links:

viewtopic.php?f=81&t=4040
https://github.com/Largo75/HomophonicSu … Solver.git

Posted : January 3, 2019 6:06 pm

Zodiac Discussion Forum

AZdecrypt code speedup suggestions