Search
Enter Keywords:
Home
Cracking the Zodiac Killer's "340 Cipher" Part 3 PDF Print E-mail
User Rating: / 3
PoorBest 
Written by Michael Salsbury   
Sunday, 07 May 2006

In the last installment, I told you something about the shortcuts and general flow of execution of the custom Visual Basic program I've written to try to crack The Zodiac Killer's thus-far-unbroken cipher known colloquially as "The 340 Cipher".  This time around I'll tell you a bit more about the program itself.

The program makes all of its "deciphering" decisions based on what I'm calling a "gated scoring algorithm" intended to expend the least possible effort determining that a potential "decode" of the message really does look like a decode.  The algorithm works something like this:

  • First, the program looks and counts the frequency of letters in the "decoded" message.  If the letters which appear most commonly in normal English writings appear in approximately the right frequency in the message, it generates a base score.  If that base score is too low (i.e., there are not enough of the most-common letters in the message), scoring stips here before too many processor cycles are used.
  • If the program finds "approximately" the right frequency of letters in the message, it then gets more granular about the letters it's looking for.  It makes sure that in the decoded message it finds the approximately-correct frequency of As, Bs, Cs, etc.  Each letter that is occurring in approximately the right frequency (+ or - 20% of normal English text) gets a higher score than those which don't.  Any letter appearing "too often" deducts from the total score (e.g., lots of "Zs" would drop the overall score).  If letters appear in the "decoded" message in approximately the right frequency, scoring continues. Otherwise, it stops here before more cycles are wasted.
  • If the frequency of individual letters looks good, the program looks at the most common bigraphs in the English language and compares these to the message.  If it finds approximately the right percentage, scoring moves on. Otherwise, it stops.
  • If the frequency of bigraphs looks about right, it then looks at a more granular list of bigraphs, trigraphs, and quadgraphs and scores the message based on whether these seem to be appearing in about the right amounts for a normal English text.  If so, the score is increased. If not, it isn't.  If the score isn't sufficiently high enough, no more scoring effort is performed.
  • Assuming the breakdown of bigraphs, trigraphs, and quadgraphs is within a reasonable tolerance from "normal" English text, the program then pores through a 20,000-word English dictionary, going from the longest to the shortest words. This dictionary provides a score for each word, with added weight given to those words the killer used often that don't occur normally in English writing (like the killer's tendency to misspell "having" as "haveing").  This part of the scoring process can take several seconds of elapsed time to complete, so it is only done by the program when there is a very good chance of finding lots of English words in the text.

I call this a "gated scoring algorithm" because the potential "decode" of the message must achieve a certain predefined score before it can get through the "gate" into a more time-consuming scoring method.  This method allows the program to "fly" past potential decodes that are worthless (like something that generates nothing but "QZCZQ" type text) and spends the most time on "decodes" that statistically look like English text.



To guard against missing anything that might be a valid decode, the program keeps a log that contains each "high score" it encounters along the way and any score that "matches" that score.  A quick look at the log lets me know if there have been any candidates that look worthy of further investigation.

As I mentioned earlier, walking through the entire search space for this message's potential keys is a project that could take more than a typical human lifetime.  Therefore, I've tried to shortcut this approach in every way possible so far.  I've set multiple CPUs to work running the program, I've copied the more interesting potential breaks into a spreadsheet I've developed that allows me to manipulate the message key by hand and see if I can make it any better, etc.  And, if it seems to get "stuck" during its execution, the program has been written to detect that fact and "randomize" some of the characters in the key so that will go through on its next iteration and potentially head down a different path.

Thus my approach through the message space isn't by any means a systematic brute force one.  For instance, when I perform a fresh start of the software, it generates 100,000 random starting keys and begins its execution with the random key that got the best score. 

From that random starting key, it goes through the message 5 characters at a time in a brute force manner, going from AAAAA to ZZZZZ through every combination in between.  The 5-character combination that generated the best overall score is kept and the next 5 are checked.

Once the entire message has been gone through 5 characters at a time, it's gone through 2 symbols at a time.  Each of the two symbols is run through all the potential values and the best-scoring letters are kept in position.  That is, we rotate through the first symbol plus the second, then the first plus the third, etc.  Then the second symbol plus the third, etc. until every 2-symbol combination has been tried and scored.

Then the program continues walking through the solution space 1, 2, 3, 4, and 5 symbols at a time, looking for values that give the best overall score from the gated scoring method.

At that point, it starts over again, trying to improve on the then-current decode.  If it goes through the entire process twice and gets the same result, it randomizes a portion of the key to point itself in a different direction.

The entire process described here literally takes days of elapsed runtime.  To guard against crashes, power outages, etc., the program regularly stops to write out its current "best" key, what technique it was working with, etc., to a file.  To pick up from where it left off, it can load this file and set itself to the exact point it was working on when it was stopped.  (This also gives me a way to "tweak things" if I think the program is getting close but not quite getting where it needs to be. I can stop it, modify the save file, and resume the process.)

As I write this blog article, the program has decided that the best decode it's seen so far for the 340 Cipher is the following (which you don't have to tell me looks like, and probably is, a lot of gibberish):

ETRBIIMAZIAVACOTUHIEGBHETHYSDACDTNGSTFOEUHOYB
NAHOETEIISMIGCAMIEEERDTHTOEOUTUATSZEZXIEIGRMS
IENCUEISBTHBHTEDLTDTUFDOADOAABOTTTTHEEHHSOESA
TUAOEREHRTSGFSAOXDNOIUAEMTEEISASGSNUERETUTAHS
GIRCODETAFIRTACTCTODEZGSOOTESHEHATSGFBDGRIOSH
MTTTTRDFATUEESGDVCTXTETBAUHTNEAFDMOGDBEIMDTOL
OTRFCEATTEAESIYGANAEENYEOYFZEEECEBNIASIUYASUG
HSEGCNEFBOHEHIIEOHEGAADDE

Granted there are a few real words in there (like "code") that may mean something or may be just "interesting looking noise" in the result.  (What I mean by this is you could write a program that just randomly pulls letters out of thin air and strings them together... sooner or later that program will spit out something that looks intelligible simply by the laws of chance.)

I'm planning once a week for the next few weeks/months to post follow-up messages to this one, including the latest "best decode" I've gotten and any other tweaks or improvements I've made to the program.

Naturally, if I should manage to decipher the message you'll find out about that also...





 


 


 


 


Related Blogs:

Last Updated ( Friday, 19 May 2006 )
< Previous   Next >

Main Menu
Home
Blog
Photos
Links
Search
Site Index
Feedback
Administrator
Featured Links
BlogInspiration
SpamToons
Shawn Prince's Blog
Jack Ludwig's Blog
Mike Cramer's Site
Fark
Slashdot
Woot!
Cigar Envy
John Kricfalusi's Blog
CigarBlog 101
Cigars 101 Forum
Sponsored Links


View Site Stats