Selecting the fast signature

§3.2 Designing a remote update algorithm 54 In reality, MD4 is “overkill” as the strong signature algorithm for rsync. It would be quite possible to use a cryptographically weaker but computationally less expen- sive algorithm such as a simple polynomial based algorithm. This wasn’t done be- cause testing showed that the MD4 computation does not provide a significant bot- tleneck on modern CPUs. For example, on a 200 MHz Pentium processor the MD4 implementation achieved 6 MBsec throughput, which is far in excess of most local area networks. As rsync is aimed at low bandwidth networks the computational cost of MD4 is insignificant 7 .

3.2.5 Selecting the fast signature

The role of the fast signature in rsync is to act as a filter, preventing excessive use of the strong signature algorithm. The most important features is that it needs to be able to be computed very cheaply at every byte offset in a file. The first fast signature algorithm that I investigated was just a concatenation of the first 4 and last 4 bytes of each block. Although this worked, it had the serious flaw that for certain types of structured data such as tar files sampling a subset of the bytes in a block provided a very poor signature algorithm. It was not uncommon to find a large proportion of byte offsets in a file that had the same fast signature, which resulted in an excessive use of the strong signature algorithm and very high computational cost. This led to the investigation of fast signature algorithms which depended on all the bytes in a block, while requiring very little computation to find the signature values for every byte offset in a file. Perhaps the simplest such algorithm is R a = ∑ a i This would be very fast because it can be computed incrementally with one addition and one subtraction to “slide” the signature from one byte offset to the next 8 . The problem with this signature is that it is independent of the order of the bytes in the buffer. This means that, when sliding the buffer, if we add the same byte on the end of 7 The use of MD4 also seemed to be an important factor in convincing skeptical users of the safety of the file transfer algorithm. The fact that a failed transfer is equivalent to “cracking” MD4 allayed suspicions significantly and led to the faster adoption of the algorithm. 8 I will sometimes refer to this property as the rolling property. §3.2 Designing a remote update algorithm 55 the buffer as we remove from the start then we end up with the same signature value. The solution is to make the signature order dependent by introducing factors de- pendent on i into the signature. The algorithm that was chosen is defined by 9 r 1 k, L = L −1 ∑ i =0 a i +k mod M r 2 k, L = L −1 ∑ i =0 L − ia i +k mod M r k, L = r 1 k, L + Mr 2 k, L where rk, L is the signature at offset k for a block of length L. M is an arbitrary modulus, and was chosen to be 2 16 for simplicity and speed. Note that this results in a 32 bit signature. This signature can be computed incrementally as follows r 1 k + 1, L = r 1 k, L − a k + a k +L mod M r 2 k + 1, L = r 2 k, L − La k + r 1 k + 1, L mod M r k + 1, L = r 1 k + 1, L + Mr 2 k + 1, L This allows the computation of successive values of the fast signature with 3 additions, 2 subtractions, 1 multiplication and a shift, assuming that M is a power of two 10 .

3.2.6 The signature search algorithm