Compression resync Data transformations

§4.4 Data transformations 77 tribute PGP or MD5 fingerprints of the compressed archives. Unless the exact contents of the compressed archives are preserved these fingerprints become useless. Although it would be possible to distribute fingerprints of the uncompressed files rather than the compressed archive this may be inconvenient for users who wish to confirm the contents of downloaded archives before they are unpacked. This is espe- cially true if a self-extracting compression format is used as any malicious content in the file could be executed during the uncompression process.

4.4.2 Compression resync

A different solution to using rsync with compressed files which overcomes these prob- lems is to use file compression algorithms which do not propagate changes through- out the rest of the compressed file when changes are made. Unfortunately most com- monly used compression algorithms use adaptive coding techniques which result in a complete change in the compressed file after the first change in the uncompressed file. It is, however, quite easy to modify almost any existing compression algorithm to limit the distance that changes propagate without greatly reducing the compression ratio. If the designers of compression algorithms were to include such a modifica- tion in future algorithms then this would significantly enhance the utility of the rsync algorithm. The modification is quite simple 10 : 1. A fast rolling signature is computed for a small window around the current point in the uncompressed file; 2. stream compression progresses as usual; 3. when the rolling signature equals a pre-determined value the compression ta- bles are reset and a token is emitted indicating the start of a new compression region. 10 Although I have found no reference to this compression modification idea in the literature I would be surprised if something similar has not been proposed before. §4.4 Data transformations 78 This works because the compression will be “synchronized” as soon as the rolling signature matches the pre-determined value. Any changes in the uncompressed file will propagate an average of 2 b c bytes through the compressed file, where b is the effective bit strength of the fast rolling signature algorithm and c is the compression ratio. The value for b can be chosen as a tradeoff between the propagation distance of changes in the compressed file and the cost in terms of reduced compression of re- setting the internal tables in the compression algorithm. For compression algorithms designed to be used for the distribution of files which are many megabytes in size a propagation distance of a few kilobytes would be appropriate 11 . In that case the first few bits of the fast signature algorithm used in rsync could be used to provide a weak fast signature algorithm.

4.4.3 Text transformation