The signature search algorithm

§3.2 Designing a remote update algorithm 55 the buffer as we remove from the start then we end up with the same signature value. The solution is to make the signature order dependent by introducing factors de- pendent on i into the signature. The algorithm that was chosen is defined by 9 r 1 k, L = L −1 ∑ i =0 a i +k mod M r 2 k, L = L −1 ∑ i =0 L − ia i +k mod M r k, L = r 1 k, L + Mr 2 k, L where rk, L is the signature at offset k for a block of length L. M is an arbitrary modulus, and was chosen to be 2 16 for simplicity and speed. Note that this results in a 32 bit signature. This signature can be computed incrementally as follows r 1 k + 1, L = r 1 k, L − a k + a k +L mod M r 2 k + 1, L = r 2 k, L − La k + r 1 k + 1, L mod M r k + 1, L = r 1 k + 1, L + Mr 2 k + 1, L This allows the computation of successive values of the fast signature with 3 additions, 2 subtractions, 1 multiplication and a shift, assuming that M is a power of two 10 .

3.2.6 The signature search algorithm

Once A has received the list of signatures from B, it must search a i for any blocks at any offset that match the signatures of some block from B. The basic strategy is to compute the fast signature for each block starting at each byte of a i in turn, and for each signature, search the list for a match. The search algorithm is very important to the efficiency of the rsync algorithm and also affects the scalability of the algorithm for large file sizes. The basic search strategy used by my implementation is shown in Figure 3.1. The first step in the algorithm is to sort the received signatures by a 16 bit hash 9 This form of the fast signature for rsync was first suggested by Paul Mackerras. It is based on ideas from the Adler checksum as used in zlib[Gailly and Adler 1998]. It also bears some similarity to the hash used in the Karp-Rabin string matching algorithm[Karp and Rabin 1987]. 10 Some alternative fast signature algorithms are discussed in the next chapter. §3.2 Designing a remote update algorithm 56 ✁✁✁✁✁ ✁✁✁✁✁ ✁✁✁✁✁ ✂✁✂✁✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂✁✂✁✂ ✂✁✂✁✂✁✂✁✂✁✂ ✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄ ✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎ ☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎ ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ signature strong signature fast signature index index 16 bit Figure 3.1 : The signature search algorithm §3.2 Designing a remote update algorithm 57 of the fast signature 11 . A 16 bit index table is then formed which takes a 16 bit hash value and gives an index into the sorted signature table which points to the first entry in the table which has a matching hash. Once the sorted signature table and the index table have been formed the signature search process can begin. For each byte offset in a i the fast signature is computed, along with the 16 bit hash of the fast signature. The 16 bit hash is then used to lookup the signature index, giving the index in the signature table of the first fast signature with that hash. A linear search is then performed through the signature table, stopping when an entry is found with a 16 bit hash which doesn’t match. For each entry the current 32 bit fast signature is compared to the entry in the signature table, and if that matches then the full 128 bit strong signature is computed at the current byte offset and compared to the strong signature in the signature table. If the strong signature is found to match then A emits a token telling B that a match was found and which block in b i was matched 12 . The search then continues at the byte after the matching block. If no matching signature is found in the signature table then a single byte literal is emitted and the search continues at the next byte 13 . At first glance this search algorithm appears to be On 2 in the file size, because for a fixed block size the number of blocks with matching 16 bit hashes will rise linearly with the size of the file. This turns out not to be a problem because the optimal block size also rises with n. This is discussed further in Section 3.3 14 , for now it is sufficient to know that the average number of blocks per 16 bit hash is around 1 for file sizes up to around 2 32 . For file sizes much beyond that an alternative signature search strategy would be needed to prevent excessive cost in the signature matching. For example the signa- tures could be arranged in a tree giving log n average lookup cost at the expense of 11 The hash algorithm used is the 16 bit sum of the two 16 bit halves of the fast signature. 12 These tokens are very amenable to simple run length encoding techniques. This is discussed further in the section dealing with stream compression of the rsync output. 13 In fact, literal bytes are buffered and only emitted when a threshold is reached or a match is found. The literal bytes may also be compressed. This is discussed more in the next chapter. 14 It rises as √ n or n depending on the definition of “optimal”. §3.3 Choosing the block size 58 some extra memory and bookkeeping.

3.2.7 Reconstructing the file