Choosing the format Practical performance

§3.5 Practical performance 63 of particular interest from the point of view of rsync because the source tree was substantially rearranged between the two releases. This means there are a large number of changes, but many of them are data movements rather than insertions or deletions 22 . • The binary distribution of Netscape Communicator professional version 4.06 to version 4.07, as distributed by Netscape in the original compressed .exe format.

3.5.1 Choosing the format

A problem with the first two files was choosing the appropriate format to process the files in. These files are normally distributed as compressed tar files but both the compression and the tar format lead to problems in making reasonable comparisons between algorithms. The compression algorithm usually gzip has the property that a small change in the uncompressed data may cause a change in all of the compressed file beyond that point. This can be dealt with using the techniques discussed in the next chapter but it unnecessarily complicates the discussion of the initial rsync algorithm and would render comparison with text-based differencing algorithms impossible. I decided to use these archives in uncompressed form. The second problem is the tar format. The tar files as distributed for Linux and to a lesser extent for Samba have varying dates for a large number of files that have the same content in both versions. In the Linux case these header changes dominate the differences between the two versions. The obvious and most commonly used solution is to run the tools with a recursion into directories option so that only files containing differences are actually processed. While both my rsync implementation and the GNU diff tools have this option, it does present some significant problems in analyzing the results. In particular the appropriate block size for rsync is quite different for each of the files in the archive, which makes a simple analysis of the results difficult. To overcome this problem I decided to transform the files into a format that is more amenable to analysis. This was done by concatenating all files in the Linux and Samba 22 The commonly used “diff” algorithm cannot encode data movements, whereas rsync can. §3.5 Practical performance 64 name size A size B diff Linux 22577611 22578079 4566 Samba 9018170 5849994 15532295 Netscape 17742163 17738110 na Table 3.1 : Sizes of the test data sets archives into single files. This produces a file containing all of the data changes of the original but without any of the complications of the tar format 23 . Table 3.1 shows the size of the two versions of each of the three sets of data. The column marked “diff” is the size of a GNU diff file taken between the two versions. As diff cannot operate correctly on binary files, diff sizes are not shown for the Netscape data sets. One interesting thing to note is the enormous size of the diff on the Samba data set, larger than the sum of the two versions of the file. The size reflects the fact that diff encodes only insertions and deletions, so the source code rearrangement made between these two versions cannot be efficiently coded by diff.

3.5.2 Speedup