§3.5 Practical performance 63
of particular interest from the point of view of rsync because the source tree was substantially rearranged between the two releases. This means there are
a large number of changes, but many of them are data movements rather than insertions or deletions
22
. • The binary distribution of Netscape Communicator professional version 4.06 to
version 4.07, as distributed by Netscape in the original compressed .exe format.
3.5.1 Choosing the format
A problem with the first two files was choosing the appropriate format to process the files in. These files are normally distributed as compressed tar files but both the
compression and the tar format lead to problems in making reasonable comparisons between algorithms.
The compression algorithm usually gzip has the property that a small change in the uncompressed data may cause a change in all of the compressed file beyond that
point. This can be dealt with using the techniques discussed in the next chapter but it unnecessarily complicates the discussion of the initial rsync algorithm and would
render comparison with text-based differencing algorithms impossible. I decided to use these archives in uncompressed form.
The second problem is the tar format. The tar files as distributed for Linux and to a lesser extent for Samba have varying dates for a large number of files that have
the same content in both versions. In the Linux case these header changes dominate the differences between the two versions. The obvious and most commonly used
solution is to run the tools with a recursion into directories option so that only files containing differences are actually processed. While both my rsync implementation
and the GNU diff tools have this option, it does present some significant problems in analyzing the results. In particular the appropriate block size for rsync is quite
different for each of the files in the archive, which makes a simple analysis of the results difficult.
To overcome this problem I decided to transform the files into a format that is more amenable to analysis. This was done by concatenating all files in the Linux and Samba
22
The commonly used “diff” algorithm cannot encode data movements, whereas rsync can.
§3.5 Practical performance 64
name size A
size B diff
Linux 22577611
22578079 4566
Samba 9018170
5849994 15532295
Netscape 17742163
17738110 na
Table 3.1
: Sizes of the test data sets
archives into single files. This produces a file containing all of the data changes of the original but without any of the complications of the tar format
23
. Table 3.1 shows the size of the two versions of each of the three sets of data. The
column marked “diff” is the size of a GNU diff file taken between the two versions. As diff cannot operate correctly on binary files, diff sizes are not shown for the Netscape
data sets. One interesting thing to note is the enormous size of the diff on the Samba data
set, larger than the sum of the two versions of the file. The size reflects the fact that diff encodes only insertions and deletions, so the source code rearrangement made
between these two versions cannot be efficiently coded by diff.
3.5.2 Speedup