I have been working on several one-way hash algorithms over the past couple weeks. None of them have yet reached the stage of ‘wow this really looks cool’, but the most recent iteration has gotten me pretty close. Now I’m looking at how to implement it in software. The algorithm I designed can create any size hash desired, with 256 and 512 bit versions specified in the algorithm description. Right now the algorithm relies on defined permutation ‘registers’, which need to be uniform across any implementation of the hash. As more details of the algorithm come together, I will be posting it here on my website. The algorithm is going to be released under an open license – GPL or similar – and I want anyone with more experience and skill in this area to provide feedback to me on the design.
As I mentioned, I’m planning on releasing this algorithm in 256- and 512-bit variants, with their associated permutation devices. Implementing the algorithm in dedicated hardware would be quite simple. However, implementing 256 or 512 bit circular shifts and permutations becomes complicated on machines that do not have built-in registers long enough to do this. As it stands right now, my method for doing this will be to read in a 256-bitblock from the file being signed, expand that into a 256-entry array, perform all of the operations (shifting, rotating, permuting) on the array, and then compact the 256 entries back down into 256 bits, and then move on to the next block. Technically, I don’t need to compact after every block is read, if I want to hold key information and current block information in memory in one of these large-entry arrays. However, there is still a pass of expanding that must take place before anything else can be done.
I am presuming that expanding 256 bits into a 256-entry array will take 256 iterations through some loop, probably with a few statements inside it adding up to, say, 1024 statements to be executed per expansion. If I only work with these expanded values, my algorithm, while still a O(n) tool, will be more closely approximated by 1024*n operations… and that’s just to handle the expansions. When you factor in the (currently) about and additional 1024 operations (4 per bit is about right) needed to maneuver the array around in memory, my algorithm will take approximately 2048*n operations to finish. If you look at a 32Mb file, there will be ~1000000 32-byte (256-bit) blocks to be processed. If every operation takes just one clock cycle, processing such a file would take about 2 billion (2*10^9) clock cycles to run, or 2 full CPU-seconds on a 1Ghz processor. Of course, there will be other processes running at the same time, and the delay between processor and main memory (let alone between the processor and disk storage) is about 10-100 to 1. Let’s be generous and say that it’s only 10 times slower to access RAM than data on the processor. Let’s also assume (for sake of argument) that the entire 32Mb file to be processed is sitting in main memory. I am now running 20 billion (2*10^10) clock cycle, or 20 CPU-seconds to finish my task. And this is only a 32Mb file. Heaven help you if you need to run the hash on a full-disc ISO image (~700Mb). That would take (again, in highly favorable conditions) about 25 times as long, or 500 CPU-seconds on a 1Ghz processor.
Yuck! Hence the need to find a more efficient way of manipulating the data. There are a few things I can do that would directly exploit 32-bit x86 architecture, but I want the algorithm to not be hardware-dependant, so I don’t want to use any fancy tricks that only work on one platform or another.
If you have any ideas about this, please contact me with any suggestions.