Friday, April 17, 2009

Episode #24: Copying and Synchronizing (Remote) Directories

Hal Says:

Copying files and directories around is a pretty common task, and there are all sorts of ways to accomplish this in Unix. But I have a strong preference for using the rsync command because (a) it allows me to copy files either within a single machine or between systems over the network, (b) it can keep directories in sync, preserve timestamps, ownerships, etc, and (c) it's smart enough to only copy files that have changed, which is a big savings on time and bandwidth.

Basic rsync usage is straightforward:

$ rsync -aH dir1/ dir2/                              #copy dir1 to dir2 on same machine
$ rsync -aH remotehost:/path/to/dir/ localdir/ #copy remote dir to local machine
$ rsync -aH dir/ remotehost:/path/to/dir/ #copy local dir to remote machine
$ rsync -aH dir/ someuser@remotehost:/path/to/dir/ #copy dir to remote host as someuser

The "-a" option combines the most common rsync options for copying and preserving directories: "-r" for recursive, "-ptog" to preserve permissions/timestamps/owner/group owner, "-l" to copy symlinks as symlinks, and "-D" to preserve device and special files. The only useful option not included in "-a" is the "-H" option to preserve hard links. Even though "-H" adds extra processing time, I tend to always use it, just to be on the safe side.

By the way, the trailing "/" characters on the source directory paths are significant. I could explain it in words, but it's easier to just show you an example:

$ ls source
bar.c baz.c foo.c
$ rsync -aH source dir1
$ ls dir1
source
$ rsync -aH source/ dir2
$ ls dir2
bar.c baz.c foo.c

Without the trailing "/", rsync copies the entire directory, including the top directory itself. With the trailing "/", the contents of the source directory are put into the top of the destination directory.

If you want to keep two directories in sync, you probably want to include the "--delete" option, which deletes files in the target directory that aren't present in the source:

$ rsync -aH source/ target
$ rm source/foo.c
$ rsync -aH source/ target
$ ls target
bar.c baz.c foo.c
$ rsync -aH --delete source/ target
$ ls target
bar.c baz.c

Sometimes there are certain files that you don't want to copy:

$ ls source
bar.c bar.o baz.c baz.o foo.c foo.o
$ rsync -aH --delete --exclude=\*.o source/ target
$ ls target
bar.c baz.c foo.c

Here I'm using "--exclude" to not copy the object files, just the source files. You can put multiple "--exclude" options on a single command line, but after a while it gets annoying to keep typing the same set of excludes over and over again. So there's also an "--exclude-from=<file>" option to read in a list of exclude patterns from the file "<file>".

rsync also has a "-n" option like the "make" command, which shows you what would happen without actually copying any files. You usually want to combine "-n" with "-v" (verbose) so you can actually see what rsync would be doing:

$ rsync -aHnv --delete --exclude=\*.o source/ newdir
sending incremental file list
created directory newdir
./
bar.c
baz.c
foo.c

sent 90 bytes received 24 bytes 228.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
Ed Joyfully Responds:
You thought you had me with your fancy shmancy rsync, didn't ya, Hal? Well... ha! A couple of years ago, you'd have been right, because there was no good built-in Windows tool for synchronizing directories. The xcopy command is a decent file copy tool, but it doesn't really police file updates to maintain synchronicity (and yes, that oblique reference to "The Police" was intentional). A couple of years ago, you'd have to install a separate tool for synchronizing. A really nice tool for doing that is robocopy, available in various resource kits, such as the Windows 2003 kit here.

But, here's the really good news: robocopy is built-in to Vista and Windows 2008 server! It's a really nice synchronization tool, and its name is far cooler than the rather pedestrian "rsync". "Robocopy" sounds like a cool mechanized buddy, or perhaps even a robotized superhero law enforcement officer.

So, how can we use robocopy to achieve what Hal does with rsync above? Well, to mirror from one local directory to another local directory, you could run:

C:\> robocopy dir1 dir2 /s

The /s makes robocopy recurse subdirectories. You could put the /s up front, but, generally speaking, it's best to put the source and destination directories first with robocopy, especially when you start to define file and directory exclusions. Note that robocopy cannot copy hard links, so we lose them (i.e., there is no rsync -H equivalent). Note also that robocopy works the same way whether you specifiy dir1 or dir1/, unlike rsync. That's ok with me even though it is slightly less flexible, as there is less of a chance that I'll screw something up.

To use robocopy to replicate something to or from a remote directory, just refer to the directory as \\[machine]\[share]\[dir], as you might expect, as in:

C:\> robocopy plans_for_world_domination \\backup\rainbowsANDunicorns /s /z
Another nice feature associated with using robocopy across a network involves what happens when network connectivity is lost. If you invoke it the right way, robocopy maintains status so that it can pick up where it left off when doing a copy. When you invoke robocopy, use the /z option to run it in restartable mode. The /z makes it maintain the status information necessary to restart a copy that is interrupted.

If you want to keep directories in sync (removing files from the destination that have been deleted from the source), you can use the /MIR option (/MIR actually means /E plus /PURGE). As with rsync, robocopy will copy all files by default. To omit certain files, we have a huge number of exclusion options, such as /XF to exclude files and /XD to exclude directories, both of which support wildcards with *.

Thus, to mimic Hal's fu above, we could run:
C:\> robocopy source target /S /MIR /XF *.o 

Oh, and to do a dry run, just printing out information about what would be copied, instead of actually doing the copies, we can invoke robocopy with the /L option, as in:

C:\> robocopy source target /L /S /MIR /XF *.o 
If you'd like to copy all file attributes, ownership information, and file permissions, invoke robocopy with /COPYALL.

The output of robocopy is quite nice as well, showing detailed statistics about what was copied, what was skipped (because it was already there, or was excluded), detailed times, and a timestamp of invocation and completion:

c:\> robocopy source target /L /S /MIR /XF *.o

-------------------------------------------------------------------------------
ROBOCOPY :: Robust File Copy for Windows

-------------------------------------------------------------------------------

Started : Sun Apr 12 07:54:51 2009

Source : c:\test\source\
Dest : c:\test\target\

Files : *.*

Exc Files : *.o

Options : *.* /L /S /E /COPY:DAT /PURGE /MIR /R:1000000 /W:30

------------------------------------------------------------------------------

4 c:\test\source\
New File 4 bar.c
New File 4 baz.c
New File 4 foo.c
1 c:\test\source\hello\

------------------------------------------------------------------------------

Total Copied Skipped Mismatch FAILED Extras
Dirs : 2 0 2 0 0 0
Files : 5 3 2 0 0 0
Bytes : 24 12 12 0 0 0
Times : 0:00:00 0:00:00 0:00:00 0:00:00

Ended : Sun Apr 12 07:54:51 2009

Yeah, robocopy!