Over the weekend, I wanted to get in some Lisp practice — especially with macros and debugging — so I decided to port my MD5 calculation code (written in C++) to the language. It has a lot of repetitious code in it that could usefully be handled by macros.
Lisp wasn’t intended for this kind of low-level binary manipulation. The best way to handle it in Lisp is to write a shared library in C (which is intended for it) and use Lisp’s FFI system to interface with it. But I decided to do it in Lisp anyway, to see how difficult it would be and to get practice with some parts of the language that I probably wouldn’t often use otherwise.
It took a while to figure out how to do the bit-manipulations in Lisp. My C++ version relied on the fact that C/C++ used a fixed number of bits to represent a number. In Lisp, there’s a fixnum
type, but the Lisp spec only guarantees that it’s a supertype of a signed 16-bit integer — there’s no way (that I was able to find) to control the actual width. If you just ask for a plain integer, it will use a variable-length integer system, which is wonderful for most true mathematical problems, but which plays havoc with left-shift operations that expect to be able to shift bits right out of the number, or additions that expect to automatically truncate to a certain number of bits. But with some extra operations (and several extra macros to implement them), it can be made to work.
It took me a few tries to get the macros the way I wanted them too. The final version uses six MD5-specific macros: two that are used by each of the others to handle the repetitious parts of the operations, and four that implement the four “rounds” defined in the MD5 specification, encapsulating the operations (as macrolet
s) and data (as three arrays of constant values) specific to each round.
The next problem was Unicode. Many Lisp implementations (including the one that I’m presently using, Steel Bank Common Lisp) fully support Unicode, so there’s a good chance that the string passed into the function may contain a Unicode character. But the MD5 algorithm is designed to work on binary data only; do I treat all strings as being composed of Unicode characters? If so, which Unicode characters — 16-bit, 32-bit, or one of the several others that are available? I finally decided to treat all strings as UTF-8; if I need to modify it to handle strings of Unicode characters later, it shouldn’t be difficult.
In addition to simply passing it a single string to calculate the MD5 hash of, I wanted to be able to calculate a hash of several strings, without concatenating them first. With a CRC, that’s easy — just keep calling the CRC function with additional data. But with MD5, as with other hashes that attempt to be cryptographically secure, it’s not as simple: there’s a padding operation with a number-of-bits-processed counter, which means that you either have to have all the data at once, or you have to keep the state of the process in between calls and have a separate “finish” call to add the padding and calculate the last parts of the hash. Of course, once you have those functions, you can put together a simple function for creating a hash of a single block of data too, which is what I did.
The final code (for the moment, at least) can be found here. It’s BSD-licensed; use or abuse it however you wish. I’m including my entire oakcircle-utilities package, with a few other small and useful macros, including my gensym-let
macro.
To use it: (md5 "abc")
should return "900150983cd24fb0d6963f7d28e17f72"
, the hash value of the string abc
. To get a binary version of it, call (md5 "abc" :as-binary t)
, which should return the number 202458070853820864278493583161023258920
(which, in hexadecimal, is 98500190B04FD23C7D3F96D6727FE128
— the same number as the return value without the :as-binary
flag but with the individual bytes of the 32-bit values that make it up reversed).
Have fun! 🙂