Hints for the Final Project

Your design must work with different message sizes in bytes, where the number of bytes need not necessarily be in multiples of 4 (i.e., corresponding to word boundaries). Also, you cannot hard-code the message size into your design.

As a test case, the message size in the current testbench is set to 120 bytes. After the padding step, the message will be expanded into three 512-bit message blocks.

A relatively straightforward implementation would process each 512-bit message block as follows:

This translates to about 162 cycles per message block, and three message blocks would take about 486 cycles.

In the FPGA implementation paper cited in the final design project page, the author described an implementation that processes each message block in 82 cycles. Effectively, his implementation interleaves the computation of w[i] with the updating of the hash { a, b, c, d, e }.

In our problem setting, likely more than 82 cycles will be required per message block because of the extra cycles required for reading the 16 words from external memory. In this case, it may take around or above 100 cycles to process each message blocks, or around or above 300 cycles to process the three message blocks.

However, conceivably, it may be possible to have fewer than 82 cycles per message block if you combine say 2 rounds of hash updates together, though the area will likely increase significantly.

Also, at the expense of additional area, you can conceiveably read ahead to the next message block, and thus hide the memory latency. e.g., when processing the jth message block, read the 16 memory words for the (j + 1)th message block. This is likely to be quite tricky to do, but possible.

Some of these ideas may be good for the Delay metric, but possibly not good for the Delay x Area metric, so optimizing for the two metrics may well require two different designs.

In any case, getting below 100 cycles per message block on average will not be easy, and it is unclear what the impact on area would be.

Finally, here is another potential optimization idea. A straightforward implementation would employ an array of 80 32-bit words to implement w[0], w[1], ..., w[79]. In the computation of w[16] ... w[19], the logic is as follows:

w[i] <= leftrotate(w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]);
To compute w[i], it suffices to keep in registers the last 16 entries w[i-1] ... w[i-16]. To save on area, it may be possible to just implement
reg [31:0] w[0:15]
with some additional logic to turn it into a shift register. We will leave it to you to figure out the logic required to make this work.