Friday, June 5, 2009

Episode #45: Removing Empty Directories

Hal answers the mail:

Loyal reader Bruce Diamond sent some email to the "suggestions" box with the following interesting problem:

I have a directory structure several layers deep not too dissimilar to:

Top (/)
|----/foo 1
| |----/bar1
| |----/bar2
| |----/bar3
| |----<files>
|----/sna
| |----/fu1
| |----/fu2
| |----/fu3
|
|----/kil
| |----/roy1
| | |----<files>
| |
| |----/roy2
| | |----<files>
| |
| |----/roy3


My problem is, I wish to identify (and then delete) the directories AND directory trees that are really, truly empty. So, in the above example, /foo/bar1, /foo/bar2, /kil/roy3, and ALL of /sna, being empty of files, would be deleted.


The Unix answer to this challenge turns out to be very straightforward if you happen to know about the "-depth" option to the "find" command. Good old "-depth" means do "depth-first traversal": in other words, dive down to the lowest level of each directory you're searching and then work your way back up. To show you how this works, here's the "find -depth" output from Bruce's sample directory structure:

$ find . -depth -type d
./kil/roy1
./kil/roy3
./kil/roy2
./kil
./sna/fu1
./sna/fu3
./sna/fu2
./sna
./foo/bar3
./foo/bar1
./foo/bar2
./foo
.

To remove the empty directories, all we have to do is add "-exec rmdir {} \;" to the end of the above command:

$ find . -depth -type d -exec rmdir {} \;
rmdir: ./kil/roy1: Directory not empty
rmdir: ./kil/roy2: Directory not empty
rmdir: ./kil: Directory not empty
rmdir: ./foo/bar3: Directory not empty
rmdir: ./foo: Directory not empty
rmdir: .: Invalid argument

"find" is calling "rmdir" on each directory in turn, from the bottom up. Directories that are completely empty are removed silently. Non-empty directories generate an error message from "rmdir" and are not removed. Since the sub-directories containing files will not be removed, their parent directories can't be removed either. On the other hand, directories like "sna" that contain only empty subdirectories will be completely cleaned up. This is exactly the behavior we want.

By the way, if you don't want to see the error messages you could always redirect the standard error to /dev/null like so:

$ find . -depth -type d -exec rmdir {} \; 2>/dev/null

Mr. Bucket points out that GNU find has a couple of extra options that make this challenge even easier:

$ find . -type d -empty -delete

The "-empty" option matches either empty files or directories, but since we're specifying "-type d" as well, we'll only match empty directories (though you could leave off the "-type d" and remove zero length files as well, and possibly clean up even more directories as a result). The "-delete" option removes any matching directories. What's cool about "-delete" is that it automatically enables the "-depth" option so that we don't have to specify it on the command line.

Why do I have the feeling that this is another one of those "easy for Unix, hard for Windows" challenges?

Ed responds:
Awesome question, Bruce! Thanks for writing in.

And, it turns out that this one isn't too bad in Windows after all. When I first saw it, I thought it might get kinda ugly, especially after reading Hal's comments above. But, then, I pulled a little trick using the sort command, and it all worked out ok. But let's not get ahead of ourselves.

You see, we can get a directory listing using our trusty old friend, the dir command, as follows:

C:\> dir /aD /s /b .
[dir]\foo
[dir]\kil
[dir]\sna
[dir]\foo\bar1
[dir]\foo\bar2
[dir]\foo\bar3
[dir]\kil\roy1
[dir]\kil\roy2
[dir]\kil\roy3
[dir]\sna\fu1
[dir]\sna\fu2
[dir]\sna\fu3

This command tells dir to list all entities in the file system under our current directory (.) with the attribute of directory (/aD), recursing subdirectories (/s), with the bare form of output (/b) -- which we use to make the dir command show the full path of each directory. You could leave off the . in the command, but I put it in there as a place holder showing where you'd add any other directory you'd like to recurse this command through.

Nice! But, we can't just delete these directories listed this way. We need some method of doing a depth-first search, simulating the behavior of the Linux find command with it's -depth option. Well, dir doesn't do that. I pondered this for about 15 seconds, when it hit me. We can just pipe our output through "sort /r" to reverse it. Because sort does it's work alphabetically, when we do a reverse sort, the shorter dir paths will come before the longer (i.e., deeper) ones, so the output will actually be a depth-first listing! Nice!

C:\> dir /aD /s /b . | sort /r
[dir]\sna\fu3
[dir]\sna\fu2
[dir]\sna\fu1
[dir]\sna
[dir]\kil\roy3
[dir]\kil\roy2
[dir]\kil\roy1
[dir]\kil
[dir]\foo\bar3
[dir]\foo\bar2
[dir]\foo\bar1
[dir]\foo
Now that we have that workable component, let's make it delete the directories that are empty. We'll wrap the above dir & sort combo in a FOR /F loop to iterate over each line of its output, feeding it into the rmdir command to remove directories. If you ever want to run a command to process each line of output of another command, FOR /F is the way to do it, specifying your original command inside of single quotes in the in () component of the FOR /F loop. Like Hal, we'll rely on the fact that rmdir will not remove directories that have files in them, but will instead write a message to standard error. Truly empty directories, however, will be silently deleted. The result is:

C:\> for /F "delims=?" %i in ('dir /aD /s /b . ^| sort /r') do @rmdir "%i"
I put the "delims=?" in the FOR /F loop to dodge a bit of ugliness with the default parsing of FOR /F. You see, if any of the directory names in the output of the dir command has a space in them, the FOR /F loop will parse the directory name and assign the %i variable the value of the text before the space. We'd only have part of the directory name, which, as Steve Jobs would say, is a bag of hurt. We need a way to turn off the default space-based parsing of FOR /F. We can do that by specifying a custom delimiter of a character that can't be used in a file's name. In Windows, we could use any of the following / \ : * ? " < > |. I chose to use a ? here, because no directory name should have that. Thus, %i will get the full directory name, spaces and all.

The ^ before the | is also worthy of a bit of discussion. FOR /F loops can iterate over the output of command by specifying a command inside of single quotes in the "in ()" part of the FOR loop declaration. But, if the command has any funky characters, including commas, quotation marks, or pipe symbols, we have to put a ^ in front of the funky symbol as an escape so FOR handles it properly. The other option we have is to put the whole command inside of single quote double quote combinations, as in:

... in ('"dir /aD /s /b . | sort /r"')... 

That's a single quote followed by a double quote up front, and a double quote single quote at the end.

If I have only one funky character in my FOR /F command, I usually just pop in a ^ in front of it. If I have several of them, rather than escaping each one with a ^, I use the single-quote double-quote trick.

Going back to our original command, we'll see an error message of "The directory is not empty." any time we try to rmdir a directory with files in it. We can get rid of that message by simply taking standard error and throwing it into nul by appending 2>nul to the overall command above.

Tim Medin (aka, our PowerShell "Go-To" Guy) adds:
The PowerShell version of the command is very similar to Ed's command, with one notable exception, length.

As Hal explained, we need a list of the directories sorted in depth-first order. Unfortunately, there isn't an option like "-depth" to do it for us, so we have to do it the long way. This command will retrieve a depth-first list of directories:

Short Version:
PS C:\> gci -r | ? {$_.PSIsContainer} | sort -desc fullName

Long Version:
PS C:\> Get-ChildItem -Recurse | Where-Object {$_.PSIsContainer} |
Sort-Object -Descending FullName

The first portion of the command retrieves a recursive directory listing. The second portion filters for containers (directories) only. The directories are then sorted in reverse order so we end up with a listing similar to that retrieved by Hal.

For those of you not familiar with PowerShell, the names of these commands might seem a little odd. The reason for the odd name is that these commands are very generic. The Get-ChildItem command works like the dir command, but it can do much more. It can be used to iterate through anything with a hierarchical structure such as the registry. The PSIsContainer applies to these generic objects such as directories or registry keys. The $_ variable refers to the "current pipeline object." Back to our regularly scheduled programming...

So we have a depth-first directory listing similar to this:
C:\sna\fu3
C:\sna\fu2
C:\sna\fu1
C:\sna
C:\kil\roy3
C:\kil\roy2
C:\kil\roy1
C:\kil
C:\foo1\bar3
C:\foo1\bar2
C:\foo1\bar1
C:\foo1


Now we need to check if our current directory is blank, so we can later delete it.

!(gci)


This command will return True if there are no items in the current directory. We can use it in a "where-object" command to filter our results.

Finally, we pipe the results into rm (Remove-Item). Our final command looks like this:

Short Version:
PS C:\> gci -r | ? {$_.PSIsContainer} | sort -desc FullName |
? {!(gci $_.FullName)} | rm

Long Version:
PS C:\> Get-ChildItem -Recurse | Where-Object {$_.PSIsContainer} |
Sort-Object -Descending FullName | Where-Object {!(Get-ChildItem $_.FullName)} |
Remove-Item

Looks like Hal's and Ed's Kung Fu are much shorter, and as they say, "size does matter."

-Tim Medin