Incremental backup systems efficient algorithms for sorting and synchronization 1999

§5.3 Incremental backup systems 92 name file size gzip bzip2 rzip Emacs 47462400 3.66 4.62 5.00 Linux 47595520 4.24 5.22 5.54 Samba 41584640 3.50 4.78 8.93 archive 27069647 3.64 4.97 5.48 spamfile 84217482 8.43 14.23 26.11 Table 5.2 : Compression ratios for gzip, bzip2 and rzip

5.3 Incremental backup systems

One of the first applications for rsync that sprang to mind after it was initially de- veloped was as part of an incremental backup system. Although this has not been implemented yet, it is interesting to examine how such a system would work. Imagine that you need to backup a number of large files to tape, and that these files change regularly but each change only affects a small part of each file. An example of an application that might generate such files is a large database 9 . A typical backup system would either backup all files or backup any files that had changed in any way since the last backup. If the time taken to do backups is important as it often is then this is an inefficient way of doing backups. The solution is to backup only the changes in the files rather than the whole file every time. The problem with doing this is that normal differencing algorithms need access to the old file in order to compute the differences. This is where rsync comes in. To backup incrementally with rsync the backup application needs to store the fast and strong signatures used in the rsync algorithm along with the file. When the time comes to perform an incremental backup these signatures are then read and the rsync matching algorithm is applied to the new file, generating a stream of literal blocks and block match tokens. This delta stream is then stored to tape as the incremental backup. Note that the backup system does not need to read the old file from tape in order 9 I should note that rsync seems to work particularly well with large database files. It is used by several large companies including Nokia and Ford to synchronize multi-gigabyte engineering databases over slow network links and I use it myself for a number of smaller databases. I didn’t include a database file in the example data sets because I couldn’t find one that didn’t contain any proprietary data. A proprietary data set would have prevented other researchers from reproducing the results. §5.4 rsync in HTTP 93