Wolf1 almost done, and Wolf2 proposed

I have been working on finishing the imlpementation of my first one-way hash algorithm, named Wolf1. Over the weekend, I devised a second algorithm, aptly named Wolf2, which will be debuting on my website at the same time Wolf1 appears. The general form for Wolf1 is as follows:

Wolf1 is a 256-bit one-way hash algorithm, which operates on 8 32-bit chunks, and uses the following special constants C0..3:
C0 - 0x5538a3e5 0xf1ef632f 0x7df8a258 0xcef9c7ba 0x765f5cda 0x497fecf2 0xfa989ea3 0xd594974e
C1 - 0xc1eac8f1 0xc9858068 0xc9858d68 0xf489d67b 0x16a22887 0x998aa773 0x342df4ad 0x9169f4b5
C2 - 0xc3566e73 0xb8b539b1 0xf19dbaa9 0xb7252916 0x8b3156be 0x6b3163b7 0x59318bb2 0xa4e99a7a
C3 - 0xc477357d 0x7a69bc6c 0x6bca3744 0xb2a0d222 0xc9a8a8b3 0x613b5ad5 0x68588990 0xbc486357
an additional special constant R is defined as follows:
R = (C0 ^ C1) + (C2 ^ C3)
The following functions are defined:
F1 (B,K,C0) -> B = (B ^ K) + C0
F2 (K,B,C1) -> K = (K + B) ^ C1
F3 (C0,C2,K) -> C0 = (C0 ^ C2) + K
F4 (C1,C3,B) -> C1 = C1 ^ (B + C3)
F5 (C2,B,K) -> C2 = (B - C2) + K
F6 (C3,B,R) -> C3 = (C3 + B) - R
F7 (R,C0,C1,C2,C3) -> R = (C0 ^ C1) + (C2 ^ C3)
After the functions F1..7 are performed on the current block B, the following rotations are applied per 32-bit chunk:
if right-most bit is 1, rotate left by 1,3,5,7 bits, the value of the right-most 3 bits
else rotate right by 0,2,4,6 bits, the value of the right-most 3 bits
Other special values:
A - ASCII sum of file contents, a 32-bit value
S - size in bits of file, a 32-bit value
Special rules:
the last block is the last chunk of the file padded to exactly 192 bits with repeating pattern of 10..
the last 64 bits of the last block is A & S appended to the 192-bit chunk to make 256-bit block
Hash signature is the final value of K

That’s quite a mouthful, but it describes the full algorithm. Wolf2’s specification will be published in a few days.

I am specifically looking for people to analyze the algorithm, and tell me of any problems with it. Being just an amateur, I can’t really predict the cryptographic strength of any algorithm I propose, though I can do some statistical tests. I don’t understand linear or differential cryptanalytic techniques, though I would like to learn how they work. If anyone can give a response as to the security/one-wayness of the algorithm, and collision resistance, resistance to breaking, I would greatly appreciate it.

“God may not play dice with the universe, but something strange is going on with the prime numbers.” –Paul ErdÅ‘s

More hashing stuff

Last night I went through Knuth’s books and converted some of the provided constants into hexadecimal values. I then snagged 32 32-bit numbers and have used them as the initial constants in my new hash algorithm. Wolf1 provides a 256-bit signature, after performing several cascading and avalanching activities on the blocks read in. These operations include rotations, xors, adds, and substracts. The 256-bit hash is broken up into 8 32-bit chunks, and then mutilated, mangled, and masticated until it produces a spiffy 256-bit signature at the end.

A full algorithm specification will be posted in a few days, along with an initial implementation.

First hashing follow-up

In speaking with one of my professors this afternoon, and in re-reading the one-way hash chapter in Bruce Schneier’s excellent book Applied Cryptography, I’ve discovered that the only reasonable way of writing my hashing algorithm is to stick to chunks that are register-sized, or at most the size of all of the available registers (4 on IA32). Implementing such an algorithm, then, will limit me to 128-bit operations on an x86 processor, and having to join the two halves of the hash together to build the entire 256-bit signature. While this isn’t ideal, it’s substantially faster than the methods I wrote about yesterday, which amount to building a virtual machine for my algorithm to run inside of.

I am now reworking my algorithm a bit to comply with these new restrictions. Until somebody goes off and builds a processor for general use like the one I describe in my second Connexions paper (though IBM’s new Cell architecture is close), I’ll need to work within these restricted rules. I’m going to write reference implementations of the algorithm for x86, x86-64, and PowerPC architectures over the next several days, and publish them along with the design criteria and algorithm specification here on my website.

“Science is facts; just as houses are made of stones, so is science made of facts; but a pile of stones is not a house and a collection of facts is not necessarily science.” –Henri Poincare

Some open questions about hashing

I have been working on several one-way hash algorithms over the past couple weeks. None of them have yet reached the stage of ‘wow this really looks cool’, but the most recent iteration has gotten me pretty close. Now I’m looking at how to implement it in software. The algorithm I designed can create any size hash desired, with 256 and 512 bit versions specified in the algorithm description. Right now the algorithm relies on defined permutation ‘registers’, which need to be uniform across any implementation of the hash. As more details of the algorithm come together, I will be posting it here on my website. The algorithm is going to be released under an open license – GPL or similar – and I want anyone with more experience and skill in this area to provide feedback to me on the design.

As I mentioned, I’m planning on releasing this algorithm in 256- and 512-bit variants, with their associated permutation devices. Implementing the algorithm in dedicated hardware would be quite simple. However, implementing 256 or 512 bit circular shifts and permutations becomes complicated on machines that do not have built-in registers long enough to do this. As it stands right now, my method for doing this will be to read in a 256-bitblock from the file being signed, expand that into a 256-entry array, perform all of the operations (shifting, rotating, permuting) on the array, and then compact the 256 entries back down into 256 bits, and then move on to the next block. Technically, I don’t need to compact after every block is read, if I want to hold key information and current block information in memory in one of these large-entry arrays. However, there is still a pass of expanding that must take place before anything else can be done.

I am presuming that expanding 256 bits into a 256-entry array will take 256 iterations through some loop, probably with a few statements inside it adding up to, say, 1024 statements to be executed per expansion. If I only work with these expanded values, my algorithm, while still a O(n) tool, will be more closely approximated by 1024*n operations… and that’s just to handle the expansions. When you factor in the (currently) about and additional 1024 operations (4 per bit is about right) needed to maneuver the array around in memory, my algorithm will take approximately 2048*n operations to finish. If you look at a 32Mb file, there will be ~1000000 32-byte (256-bit) blocks to be processed. If every operation takes just one clock cycle, processing such a file would take about 2 billion (2*10^9) clock cycles to run, or 2 full CPU-seconds on a 1Ghz processor. Of course, there will be other processes running at the same time, and the delay between processor and main memory (let alone between the processor and disk storage) is about 10-100 to 1. Let’s be generous and say that it’s only 10 times slower to access RAM than data on the processor. Let’s also assume (for sake of argument) that the entire 32Mb file to be processed is sitting in main memory. I am now running 20 billion (2*10^10) clock cycle, or 20 CPU-seconds to finish my task. And this is only a 32Mb file. Heaven help you if you need to run the hash on a full-disc ISO image (~700Mb). That would take (again, in highly favorable conditions) about 25 times as long, or 500 CPU-seconds on a 1Ghz processor.

Yuck! Hence the need to find a more efficient way of manipulating the data. There are a few things I can do that would directly exploit 32-bit x86 architecture, but I want the algorithm to not be hardware-dependant, so I don’t want to use any fancy tricks that only work on one platform or another.

If you have any ideas about this, please contact me with any suggestions.