§3.2 Designing a remote update algorithm 52
This would work, but it is not practical because of the computational cost of com- puting a reasonable signature on every possible block. It could be made computa-
tionally feasible by making the signature algorithm very cheap to compute but this is hard to do without making the signature too weak. A weak signature would make
the algorithm unusable. For example, the signature could be just the first 4 bytes of each block. This would
be very easy to compute but the algorithm would fail to produce the right result when two different blocks had their first 4 bytes in common.
3.2.3 Two signatures
The solution and the key to the rsync algorithm is to use not one signature per block, but two. The first signature needs to be very cheap to compute for all byte offsets and
the second signature needs to have a very low probability of collision. The second, expensive signature then only needs to be computed by A at byte offsets where the
cheap signature matches one of the cheap signatures from B. If we call the two signatures R and H then the algorithm becomes
6
: 1. B divides b
i
into N equally sized blocks b
′ j
and computes signatures R
j
and H
j
on each block. These signatures are sent to A. 2. For each byte offset i in a
i
A computes R
′ i
on the block starting at i. 3. A compares R
′ i
to each R
j
received from B. 4. For each j where R
′ i
matches R
j
A computes H
′ i
and compares it to H
j
. 5. If H
′ i
matches H
j
then A sends a token to B indicating a block match and which block matches. Otherwise A sends a literal byte to B.
6. B receives literal bytes and tokens from A and uses these to construct a
i
. For this algorithm to be effective and efficient we need the following conditions:
• the signature R needs to be cheap to compute at every byte offset in a file;
6
I call them R and H for rolling checksum and hash respectively. Hopefully those names will become clear shortly.
§3.2 Designing a remote update algorithm 53
• the signature H needs to have a very low probability of random collision; and • A needs to perform the matches on all block signatures received from B very
efficiently, as this needs to be done at all byte offsets. Most of the rest of this chapter deals with the selection of the two signature algo-
rithms, and the related problem of implementing the matching function efficiently.
3.2.4 Selecting the strong signature
The strong signature algorithm is the easier of the two. It doesn’t need to be particu- larly fast as it is only computed on block boundaries by B and at byte boundaries on
A only when the fast signature matches.
The main property that the algorithm must have is that if two blocks are different they should have a very low probability of having the same signature. There are
many well known algorithms that have this property, perhaps the best known being the message digest algorithms commonly used in cryptographic applications. These
algorithms are believed to have the following properties where b is the number of bits in the signature[Schneier 1996]:
• The probability that a randomly generated block has the same signature as a given block is O2
−b
. • The computational difficulty of finding a second block that has the same signa-
ture as a given block is Ω
2
b
. • The individual bits in the signature are uncorrelated and have a uniform distri-
bution. These properties make a message digest algorithm ideal for rsync. The particular
algorithm that is used for most of the results in this chapter is the 128 bit MD4 mes- sage digest[Rivest 1990]. This algorithm was chosen because of the ready availability
of source code implementations and the high throughput compared to many other algorithms. Although MD4 is thought to be not as cryptographically strong as some
later algorithms such as MD5 or IDEA the difference is unimportant for rsync.
§3.2 Designing a remote update algorithm 54
In reality, MD4 is “overkill” as the strong signature algorithm for rsync. It would be quite possible to use a cryptographically weaker but computationally less expen-
sive algorithm such as a simple polynomial based algorithm. This wasn’t done be- cause testing showed that the MD4 computation does not provide a significant bot-
tleneck on modern CPUs. For example, on a 200 MHz Pentium processor the MD4 implementation achieved 6 MBsec throughput, which is far in excess of most local
area networks. As rsync is aimed at low bandwidth networks the computational cost of MD4 is insignificant
7
.
3.2.5 Selecting the fast signature