§3.2 Designing a remote update algorithm 51
means that S cannot uniquely identify all possible files b
i
which means that the algo- rithm must have a finite probability of failure. We will look at this more closely after
the algorithm has been fleshed out some more
3
.
3.2.1 First attempt
An example of a very simple form for the algorithm is 1. B divides b
i
into N equally sized blocks b
′ j
and computes a signature S
j
on each block
4
. These signatures are sent to A. 2. A divides a
i
into N blocks a
′ k
and computes S
′ k
on each block. 3. A searches for S
j
matching S
′ k
for all k 4. for each k, A sends to B either the block number j in S
j
that matched S
′ k
or a literal block a
′ k
5. B constructs a
i
using blocks from b
i
or literal blocks from a
i
. This algorithm is very simple and meets some of the aims of our remote update
algorithm, but it is useless in practice
5
. The problem with it is that A can only find matches that are on block boundaries. If the file on A is the same as B except that one
byte has been inserted at the start of the file then no block matches will be found and the algorithm will transfer the whole file.
3.2.2 A second try
We can solve this problem by getting A to generate signatures not just at block bound- aries, but at all byte boundaries. When A compares the signature at each byte bound-
ary with each of the signatures S
j
on block boundaries of b
i
it will be able to find matches at non-block offsets. This allows for arbitrary length insertions and deletions
between a
i
and b
i
to be handled.
3
While there has been some earlier information theoretic work on the remote update prob- lem[Orlitsky 1993], that research did not lead to a practical algorithm.
4
Think of the blocks as being a few hundred bytes long, we will deal with the optimal size later.
5
The above algorithm is the one most commonly suggested to me when I first describe the remote update problem to colleagues.
§3.2 Designing a remote update algorithm 52
This would work, but it is not practical because of the computational cost of com- puting a reasonable signature on every possible block. It could be made computa-
tionally feasible by making the signature algorithm very cheap to compute but this is hard to do without making the signature too weak. A weak signature would make
the algorithm unusable. For example, the signature could be just the first 4 bytes of each block. This would
be very easy to compute but the algorithm would fail to produce the right result when two different blocks had their first 4 bytes in common.
3.2.3 Two signatures