Compressed files Data transformations

§4.4 Data transformations 76

4.4 Data transformations

One of the weaknesses of the rsync algorithm is that there are some cases where seem- ingly small modifications do not result in common blocks between the old and new files which can be matched by the algorithm. The two situations where this is particu- larly common are compressed files and files where systematic global editing has been done. In both cases the rsync algorithm can be modified to include a data transforma- tion stage to allow these files to be updated efficiently.

4.4.1 Compressed files

Imagine a situation where a server stores text files such as source code or documents in a compressed form, using a common compression algorithm such as gzip, bzip or zip. A property of these compression algorithms is that any change in one part of the uncompressed file results in changes throughout the rest of the compressed file. This means that if a change is made towards the start of the uncompressed file and the new file is stored in a compressed format then an rsync update of the compressed file to a machine containing the old compressed file will end up sending almost all of the file. A solution to this problem is to uncompress the files at both ends before the rsync algorithm is applied, and then re-compress the files after the transfer is complete. This simple data transformation allows the efficient update of compressed files. This solution isn’t ideal, however. The problem is that many compression formats are not symmetric, so that when a file is uncompressed then re-compressed you do not necessarily end up with identical files. This is particularly the case when different versions of the compression utility are used. Even if the data formats are compatible there is no guarantee that the compressed data will be identical. In some cases this isn’t a problem as it is the uncompressed content of the com- pressed file that is important, rather than the compressed file itself, but there are im- portant situations where the exact bytes in the compressed file must be preserved. One such case is where a site is distributing software 9 that has been digitally signed by the authors to ensure that no tampering is possible. It is common practice to dis- 9 One of the most common uses for rsync is as an anonymous rsync server for distributing and mir- roring software archive sites on the Internet. §4.4 Data transformations 77 tribute PGP or MD5 fingerprints of the compressed archives. Unless the exact contents of the compressed archives are preserved these fingerprints become useless. Although it would be possible to distribute fingerprints of the uncompressed files rather than the compressed archive this may be inconvenient for users who wish to confirm the contents of downloaded archives before they are unpacked. This is espe- cially true if a self-extracting compression format is used as any malicious content in the file could be executed during the uncompression process.

4.4.2 Compression resync