|
In the last installment, I told you something about the shortcuts
and general flow of execution of the custom Visual Basic program I've
written to try to crack The Zodiac Killer's thus-far-unbroken cipher
known colloquially as "The 340 Cipher". This time around I'll tell you
a bit more about the program itself.
The program makes all of its
"deciphering" decisions based on what I'm calling a "gated scoring
algorithm" intended to expend the least possible effort determining
that a potential "decode" of the message really does look like a
decode. The algorithm works something like this:
- First, the program looks and counts the frequency of letters in
the "decoded" message. If the letters which appear most commonly in
normal English writings appear in approximately the right frequency in
the message, it generates a base score. If that base score is too low
(i.e., there are not enough of the most-common letters in the message),
scoring stips here before too many processor cycles are used.
- If the program finds "approximately" the right frequency of
letters in the message, it then gets more granular about the letters
it's looking for. It makes sure that in the decoded message it finds
the approximately-correct frequency of As, Bs, Cs, etc. Each letter
that is occurring in approximately the right frequency (+ or - 20% of
normal English text) gets a higher score than those which don't. Any
letter appearing "too often" deducts from the total score (e.g., lots
of "Zs" would drop the overall score). If letters appear in the
"decoded" message in approximately the right frequency, scoring
continues. Otherwise, it stops here before more cycles are wasted.
- If the frequency of individual letters looks good, the program
looks at the most common bigraphs in the English language and compares
these to the message. If it finds approximately the right percentage,
scoring moves on. Otherwise, it stops.
- If the frequency of bigraphs looks about right, it then looks at
a more granular list of bigraphs, trigraphs, and quadgraphs and scores
the message based on whether these seem to be appearing in about the
right amounts for a normal English text. If so, the score is
increased. If not, it isn't. If the score isn't sufficiently high
enough, no more scoring effort is performed.
- Assuming the breakdown of bigraphs, trigraphs, and quadgraphs is
within a reasonable tolerance from "normal" English text, the program
then pores through a 20,000-word English dictionary, going from the
longest to the shortest words. This dictionary provides a score for
each word, with added weight given to those words the killer used often
that don't occur normally in English writing (like the killer's
tendency to misspell "having" as "haveing"). This part of the scoring
process can take several seconds of elapsed time to complete, so it is
only done by the program when there is a very good chance of finding
lots of English words in the text.
I call this a "gated scoring algorithm" because the potential
"decode" of the message must achieve a certain predefined score before
it can get through the "gate" into a more time-consuming scoring
method. This method allows the program to "fly" past potential decodes
that are worthless (like something that generates nothing but "QZCZQ"
type text) and spends the most time on "decodes" that statistically
look like English text.
To guard against missing anything that
might be a valid decode, the program keeps a log that contains each
"high score" it encounters along the way and any score that "matches"
that score. A quick look at the log lets me know if there have been
any candidates that look worthy of further investigation. As I
mentioned earlier, walking through the entire search space for this
message's potential keys is a project that could take more than a
typical human lifetime. Therefore, I've tried to shortcut this
approach in every way possible so far. I've set multiple CPUs to work
running the program, I've copied the more interesting potential breaks
into a spreadsheet I've developed that allows me to manipulate the
message key by hand and see if I can make it any better, etc. And, if
it seems to get "stuck" during its execution, the program has been
written to detect that fact and "randomize" some of the characters in
the key so that will go through on its next iteration and potentially
head down a different path. Thus my approach through the message
space isn't by any means a systematic brute force one. For instance,
when I perform a fresh start of the software, it generates 100,000
random starting keys and begins its execution with the random key that
got the best score. From that random starting key, it goes
through the message 5 characters at a time in a brute force manner,
going from AAAAA to ZZZZZ through every combination in between. The
5-character combination that generated the best overall score is kept
and the next 5 are checked. Once the entire message has been gone
through 5 characters at a time, it's gone through 2 symbols at a time.
Each of the two symbols is run through all the potential values and the
best-scoring letters are kept in position. That is, we rotate through
the first symbol plus the second, then the first plus the third, etc.
Then the second symbol plus the third, etc. until every 2-symbol
combination has been tried and scored. Then the program continues walking through the solution space 1, 2, 3, 4, and 5 symbols at a time, looking for values that give the best overall score from the gated scoring method. At that point, it starts over again, trying to improve on the then-current decode. If it goes through the entire process twice and gets the same result, it randomizes a portion of the key to point itself in a different direction. The entire process described here literally takes days of elapsed runtime. To guard against crashes, power outages, etc., the program regularly stops to write out its current "best" key, what technique it was working with, etc., to a file. To pick up from where it left off, it can load this file and set itself to the exact point it was working on when it was stopped. (This also gives me a way to "tweak things" if I think the program is getting close but not quite getting where it needs to be. I can stop it, modify the save file, and resume the process.) As I write this blog article, the program has decided that the best decode it's seen so far for the 340 Cipher is the following (which you don't have to tell me looks like, and probably is, a lot of gibberish): ETRBIIMAZIAVACOTUHIEGBHETHYSDACDTNGSTFOEUHOYB NAHOETEIISMIGCAMIEEERDTHTOEOUTUATSZEZXIEIGRMS IENCUEISBTHBHTEDLTDTUFDOADOAABOTTTTHEEHHSOESA TUAOEREHRTSGFSAOXDNOIUAEMTEEISASGSNUERETUTAHS GIRCODETAFIRTACTCTODEZGSOOTESHEHATSGFBDGRIOSH MTTTTRDFATUEESGDVCTXTETBAUHTNEAFDMOGDBEIMDTOL OTRFCEATTEAESIYGANAEENYEOYFZEEECEBNIASIUYASUG HSEGCNEFBOHEHIIEOHEGAADDE Granted there are a few real words in there (like "code") that may mean something or may be just "interesting looking noise" in the result. (What I mean by this is you could write a program that just randomly pulls letters out of thin air and strings them together... sooner or later that program will spit out something that looks intelligible simply by the laws of chance.) I'm planning once a week for the next few weeks/months to post follow-up messages to this one, including the latest "best decode" I've gotten and any other tweaks or improvements I've made to the program. Naturally, if I should manage to decipher the message you'll find out about that also...
Related Blogs:
|