First attempt A second try

§3.2 Designing a remote update algorithm 51 means that S cannot uniquely identify all possible files b i which means that the algo- rithm must have a finite probability of failure. We will look at this more closely after the algorithm has been fleshed out some more 3 .

3.2.1 First attempt

An example of a very simple form for the algorithm is 1. B divides b i into N equally sized blocks b ′ j and computes a signature S j on each block 4 . These signatures are sent to A. 2. A divides a i into N blocks a ′ k and computes S ′ k on each block. 3. A searches for S j matching S ′ k for all k 4. for each k, A sends to B either the block number j in S j that matched S ′ k or a literal block a ′ k 5. B constructs a i using blocks from b i or literal blocks from a i . This algorithm is very simple and meets some of the aims of our remote update algorithm, but it is useless in practice 5 . The problem with it is that A can only find matches that are on block boundaries. If the file on A is the same as B except that one byte has been inserted at the start of the file then no block matches will be found and the algorithm will transfer the whole file.

3.2.2 A second try

We can solve this problem by getting A to generate signatures not just at block bound- aries, but at all byte boundaries. When A compares the signature at each byte bound- ary with each of the signatures S j on block boundaries of b i it will be able to find matches at non-block offsets. This allows for arbitrary length insertions and deletions between a i and b i to be handled. 3 While there has been some earlier information theoretic work on the remote update prob- lem[Orlitsky 1993], that research did not lead to a practical algorithm. 4 Think of the blocks as being a few hundred bytes long, we will deal with the optimal size later. 5 The above algorithm is the one most commonly suggested to me when I first describe the remote update problem to colleagues. §3.2 Designing a remote update algorithm 52 This would work, but it is not practical because of the computational cost of com- puting a reasonable signature on every possible block. It could be made computa- tionally feasible by making the signature algorithm very cheap to compute but this is hard to do without making the signature too weak. A weak signature would make the algorithm unusable. For example, the signature could be just the first 4 bytes of each block. This would be very easy to compute but the algorithm would fail to produce the right result when two different blocks had their first 4 bytes in common.

3.2.3 Two signatures