The xdelta algorithm efficient algorithms for sorting and synchronization 1999

Chapter 5 Further applications for rsync This chapter describes some alternative uses of the ideas behind rsync. Since starting work on rsync and distributing a free implementation the number of uses for the algorithm has grown enormously, well beyond the initial idea of efficient remote file update. From new types of compression systems to differencing algorithms to incre- mental backup systems the basic rsync algorithm has proved to be quite versatile. This chapter describes some of the more interesting uses for the rsync algorithm.

5.1 The xdelta algorithm

The xdelta algorithm is a binary differencing algorithm developed by Josh MacDon- ald[MacDonald 1998]. It is based on the rsync algorithm but has been extensively modified to optimize for high speed and small emitted differences, taking advantage of the fact that both files being differenced are available locally, which allows the al- gorithm to ignore the cost of sending the signatures. Binary differencing algorithms are extremely useful for transmitting differences over non-interactive links or broadcasting differences to many sites. A company which wishes to provide a more up-to-date binary for their large software product, for example, might distribute a binary patch which can be applied by users without requiring them to download the whole product a second time. Using rsync as a binary differencing algorithm is quite simple. The fast and slow signatures are calculated as usual and the hash tables are formed in the same way that B does in the rsync algorithm, although the block size is set much smaller than would normally be set for rsync. The file being differenced is then searched for block matches in the same way that A searches for matches in rsync, with the literal data and token 84 §5.1 The xdelta algorithm 85 name diff diff time xdelta xdelta time size seconds size seconds Linux 4566 21.8 1148 10.5 Samba 15532295 130 883824 10.6 Netscape na na 7066917 25.4 Table 5.1 : xdelta performance compared to GNU diff matches stored in the “difference” file. To reconstruct the new file using the difference file and the old file is the same as reconstructing the new file on B in the rsync algorithm, although it is wise to add ad- ditional checks in the form of file signatures to ensure that the file that the difference is being applied to is the same as the original file. Unlike the text based “diff” algorithm, xdelta cannot apply changes to files other than the original file. As with rsync it is possible to further improve the algorithm by using stream com- pression on the generated difference file, possibly using the enhanced compression algorithm described in Section 4.3. The block match tokens in the difference file can also be run-length encoded in the same way that was described in Section 4.2.1. Table 5.1 shows the performance of xdelta on the three test data sets from Chap- ter 3 1 . The results show a vast improvement in both the execution time and the re- sulting difference size between GNU diff and xdelta. The small size of the differences from xdelta is partly a result of delta being able to efficiently deal with the differences in the binary headers in the Linux and Samba tar files. It is also interesting to note that xdelta produced a considerably smaller difference file than the number of bytes of literal data sent by the rsync algorithm in the results from Table 3.2, despite the fact that both xdelta and rsync use what is essentially the same algorithm. The difference is primarily a result of the very small block size used by xdelta, allowing for much finer grained matches. Using such a small block size when applying rsync over a network link would result in an excessive amount of signature data being sent from B to A. 1 Version 1.0 of xdelta as distributed by Josh MacDonald was used for these tests. §5.2 The rzip compression algorithm 86