Tuesday, December 21, 2010

Episode #126: Cleaning Up The Dump

Hal's directories are bloated

It's not politically correct to say, but sometimes in Unix your directories just get fat. And like most of us, as your directories get fat, they also get slow. This is because in standard Unix file systems, directories are implemented as sequential lists of file names. They aren't even sorted, so you can't binary search them.

For example, suppose you'd just been dumping your logs into a single directory for years. You could end up with a big pile of stuff that looks like this:

# ls -ld logs
drwxr-xr-x 2 root root 266240 Dec 18 15:49 logs
# ls logs | wc -l
# ls logs

Almost 7200 files-- and as you can see the directory itself has grown to be about a quarter of a megabyte! In our example, the file names are "<log>.YYYYMMDD" with an optional ".gz" extension on the older log files that have been compressed to save space.

Well I want my directories to be fit and lean again, so I decided to move the files into a tree structure based on year and month. So I'll need to move each file to a new location such as "YYYY/MM/<log>.YYYYMMDD". That should prevent any single sub-directory from getting too bloated.

I think there are a lot of ways you could attack this one, but I decided to make some noise with sed:

# cd logs
# for file in *; do
dir=$(echo $file | sed 's/.*\.\([0-9][0-9][0-9][0-9]\)\([0-9][0-9]\).*/\1\/\2/');
mkdir -p $dir;
mv $file $dir;

Yep, that sed expression sure is noisy-- as in "line noise". What's going on here? Well I'm taking the file name as input and using sed to pull out the YYYY and the MM and reformatting them into a subdirectory name like "YYYY/MM". First I match "anything followed by a literal dot", aka ".*\.". Then I match four digits-- four instances of the set "[0-9]"-- followed by two digits. However, I enclose both groups of digits in parens-- "\( ... \)"-- so that I can use the matched values on the righthand side of the substitution. On the RHS, "\1" is the four-digit year we matched in the first parenthesized expression and "\2" is the month we matched second. So "\1\/\2" is the year and the month with a literal slash in between-- "YYYY/MM". Obvious, right?

But the sed is the hard part. Once that's over, it's a simple task to make the directory and move the file. And now our directory should be nice and skinny:

# ls
2007 2008 2009 2010
# ls -ld .
drwxr-xr-x 6 root root 266240 Dec 18 15:58 .

Wait a minute! We've only got four top-level directories under our logs directory, but the logs directory itself hasn't shrunk at all. Unfortunately, this is normal behavior for Unix-- once a directory gets big, it never loses the weight.

So how do we stop our directory from looking like Jabba the Hut? In Unix, you make a new directory and wipe out the old one:

# mkdir ../newlogs
# mv * ../newlogs
# cd ..
# rmdir logs
# mv newlogs logs
# ls -ld logs
drwxr-xr-x 6 root root 4096 Dec 18 16:09 logs

It's liposuction via cloning! A miracle of the modern age! OK, really it's a lame mis-feature of the Unix file system. But at least you now know what to do about it.

And now I want to see Tim push his big directories around. Hey Tim, your directory is so fat...
Tim feels bloated from all the Christmas food:

Hal, yo directories is so fat, when they floated around the ocean Spain claimed them as a new world.

Ok, so the joke is terrible, but the problem is real. Directories with a lot of files can really be a pain.

On Windows there isn't one directory that contains all the logs. Each service typically has its own subdirectory under C:\Windows\System32\LogFiles\. For example, the subdirectory W3SVC1 would contain the logs for the first instance of an IIS webserver. Also, with older version of Windows C:\Windows is replaced with C:\WinNT.

This LogFiles directory is used by Microsoft products and some third-party products, but of course the third-party products can put their log files in all sorts of other weird locations. For the sake of this article, we'll assume we are looking at IIS logs.

By default IIS log files are created daily with the naming convention of exyymmdd.log. Microsoft doesn't put the full four digit year, so we'll assume 20XX. Why assume post 2000? Because if you are running an IIS server from the last millennium it probably isn't your server any more (see pwned).

Let's start off by getting the names for our directories, and then we'll build on that. According to Microsoft's IIS Log File Naming Syntax, no matter what file format or regular rotation interval (month, week, day, hour), the format always is always:

<some chars describing format><YY><MM><other numbers as used in date format>.log
We can build a regular expression replace pattern to derive directory names from the file names:

PS C:\Windows\System32\LogFiles\W3SVC1> ls *.log | % { [regex]::Replace($, '[^0-9]*([0-9]{2})([0-9]{2}).*', '20$1\$2') }
We use a ForEach-Object (alias %) loop on the output of our directory listing (Get-ChildItem is aliased as ls). Inside the loop we use .Net to call the static Replace method in the Regex class. The Replace method takes three arguments: the input, the search pattern, and the replacement string. The input is the name of the file. The search pattern is slightly more complicated. Here is how the search pattern maps to the portions of the log created on January 16th of 2009, ex090116.log.

[^0-9]*    = ex (all the non-digits at the beginning of the file name)
([0-9]{2}) = 09 (two digit year)
([0-9]{2}) = 01 (two digit month)
.* = 16.log (the rest of the name)
We then use the replacement string to build the directory name, where $1 represents the first grouping (year) and $2 represents the second grouping. Each grouping is designated by parenthesis. For more information on .Net and Regular Expression Replacement, see this article.

Notice, in our command above we used single quotes. That is because PowerShell will expand any strings inside double quotes before our Replace method had a chance to do any replacing. This means that PowerShell would try to convert $1 into a variable and not pass the literal string to the Replace method. Here is what I mean:

PS C:\> echo "Here is my string $1"
Here is my string

PS C:\> echo 'Here is my string $1'
Here is my string $1
We could use double quotes, but we would have to add a backtick (`) before the dollar sign. The resulting command would look like this:

PS C:\Windows\System32\LogFiles\W3SVC1> ls *.log | % {
[regex]::Replace($, '[^0-9]*([0-9]{2})([0-9]{2}).*', "20`$1\`$2") }

So now we have the directory name, let's create the directory structure and move some files! I'm not going to show the full prompt so the command is less cluttered.

> Get-ChildItem *.log | ForEach-Object {
$dir = [regex]::Replace($_.Name, '[^0-9]*([0-9]{2})([0-9]{2}).*', "20`$1\`$2");
mkdir $dir -ErrorAction SilentlyContinue;
Move-Item $_ $dir }
Wow, that is a rather large command, so let's trim it down with aliases and shortened parameter names. We can't have a big ol' fat command with our nice lean directories.

> ls *.log | % {
$dir = [regex]::Replace($, '[^0-9]*([0-9]{2})([0-9]{2}).*', "20`$1\`$2");
mkdir $dir -ea;
move $_ $dir }
Inside our ForEach-Object loop we set $dir equal to the new directory name. We then create the directory. The ErrorAction (ea for short) switch tells the shell not to show us an error message or stop processing if there is a problem. In our case, we want to make sure the command continues to run even if the directory already exists. After the directory is created we move the file, which is represented by $_.

PS C:\Windows\System32\LogFiles\W3SVC1> ls

Directory: Microsoft.PowerShell.Core\FileSystem::C:\Windows\System32\LogFiles\W3SVC1

Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 12/19/2010 12:28 AM <DIR> 2008
d---- 12/19/2010 12:28 AM <DIR> 2009
d---- 12/19/2010 12:28 AM <DIR> 2010

So now we can enter the new year with leaner and meaner directories. And yes, they are meaner. Directories get pretty ticked off when you trim their children.