Tuesday, October 12, 2010

Episode #116: Stop! Haemer Time!

Hal can't touch this:

In response to our "blegging" in last week's Episode, we got a bunch of good ideas from our readers for future Episodes. But don't let that stop you from sending those cards, letters, and emails for topics you'd like to see us cover in the blog!

This week's Episode comes from long-time friend of the blog, Jeff Haemer. Not only did he send us a problem, he also sent the solution-- at least for the Unix side of the house. So, easy week for me as I sit back and explain Jeff's problem and solution.

Jeff's situation is that he's got a bunch of software build directories tagged with a software revision number and a date:

$ ls
1.2.00.00_devel-20100906 1.2.00.00_devel-20100910 2.0.00.00_devel-20100909
1.2.00.00_devel-20100907 2.0.00.00_devel-20100906 2.0.00.00_devel-20100910
1.2.00.00_devel-20100908 2.0.00.00_devel-20100907
1.2.00.00_devel-20100909 2.0.00.00_devel-20100908

The problem is that Jeff wants to clean up his build area by removing all but the last two date-stamped directories for each of the different software versions.

There are really two pieces to solving this problem and Jeff's solution is a nice little bit of "divide and conquer". The first problem is figuring out the different software version numbers that are present in each directory:

$ ls | cut -d- -f1 | uniq
1.2.00.00_devel
2.0.00.00_devel

Here we're just taking the directory listing, using cut to chop off the date stamps after the "-" and then uniq-ifying the list to get just one instance of each version number. Normally you would call sort before uniq, but in this case the ls command is sorting the directory listing for us.

The next problem is, for each version number, figure out the directories we need to remove-- i.e., everything but the two most recently date-stamped directories. The naive approach might be to start with a directory listing like this:

$ ls -d 1.2.00.00_devel*
1.2.00.00_devel-20100906
1.2.00.00_devel-20100907
1.2.00.00_devel-20100908
1.2.00.00_devel-20100909
1.2.00.00_devel-20100910

The directories we want to delete are everything except for the last two directories. You could try some tricks using head piped into tail, but that gets complicated pretty quickly. An easier approach is to just invert the problem:

$ ls -dr 1.2.00.00_devel*
1.2.00.00_devel-20100910
1.2.00.00_devel-20100909
1.2.00.00_devel-20100908
1.2.00.00_devel-20100907
1.2.00.00_devel-20100906

The "-r" flag reverses the sort order of ls. So now our problem is to extract everything except for the first two lines. And that's easy:

$ ls -dr 1.2.00.00_devel* | tail -n +3
1.2.00.00_devel-20100908
1.2.00.00_devel-20100907
1.2.00.00_devel-20100906

Notice that the correct syntax for tail is "-n +3"-- "start three lines into the input and output the rest". If you were thinking "-n +2", well let's just say you were probably in good company.

So now we know how to extract the various software versions, and how to get the names of all but the two most recent directories. The final solution is just a matter of putting those two ideas together:

$ for v in $(ls | cut -d- -f1 | uniq); do
ls -dr $v* | tail -n +3
done

1.2.00.00_devel-20100908
1.2.00.00_devel-20100907
1.2.00.00_devel-20100906
2.0.00.00_devel-20100908
2.0.00.00_devel-20100907
2.0.00.00_devel-20100906

In the for loop itself, I'm using our expression to obtain the directory version numbers inside of "$(...)", which is essentially the same thing as using backticks. However, the "$(...)" construct is preferable for reasons which we'll see in a moment. Then for each version number I'm using the expression we developed to output the names of the directories we want to remove.

Great! We're now outputting the names of all the directories we want to remove, now we want to actually remove them (note that it's always best to do this sort of confirmation before you do a dangerous operation like rm). There's a lot of different ways we could go here. I choose xargs:

$ !! | xargs rm -rf
for v in $(ls | cut -d- -f1 | uniq); do ls -dr $v* | tail -n +3; done | xargs rm -rf
$ ls
1.2.00.00_devel-20100909 2.0.00.00_devel-20100909
1.2.00.00_devel-20100910 2.0.00.00_devel-20100910

Whoa Nelly! What just happened there? Well, I used a quick command-line history substitution, namely "!!", to repeat the previous command (my for loop) and pipe the output into xargs.

Another alternative would be to use command output substitution:

rm -rf $(for v in $(ls | cut -d- -f1 | uniq); do ls -dr $v* | tail -n +3; done)

Constructs like this are why you want to use "$(...)" instead of backticks. If you tried doing the above command line with backticks, you'd get a syntax error because the shell doesn't parse "nested" backticks the way you want. On the other hand, "$(...)" nests quite nicely.

The only problem with the second solution is that if the number of directories we need to remove is large, you could theoretically overwhelm the limit for the length of a single command line. Using xargs protects you from that problem.

Anyway, thanks for the interesting problem/solution, Jeff! It looks like Tim's got his gold parachute pants on and he's ready to rock...

Tim busts the funky lyrics

I did not wear gold parachute pants in the 80s, at least there is no proof of it. And Hal, sorry to correct you on your 80s fashion, but Hammer pants are WAY different from parachute pants, and besides, I wore the silver ones.

Let's fast-forward 20 years and PowerShell this moth-ah. Similar to Hal's approach, we'll divide and conquer conqu-ah.
PS C:\> Get-ChildItem | Sort-Object Name -Descending

Directory: Microsoft.PowerShell.Core\FileSystem::C:\

Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100910
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100909
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100908
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100907
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100906
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100910
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100909
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100908
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100907
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100906


Ok, so nothing exciting. All we did was get our directory listing and sort it by name in reverse order. Now, to group them by the software version:
PS C:\> Get-ChildItem | Sort-Object Name -Descending | Group-Object { $_.Name.split("_")[0] }
Count Name Group
----- ---- -----
5 2.0.00.00 {2.0.00.00_devel-20100910, 2.0.00.00_devel-20100909, ...
5 1.2.00.00 {1.2.00.00_devel-20100910, 1.2.00.00_devel-20100909, ...


In this example, the Group-Object cmdlet uses a script block to define how the groups are created. The groupings are created by taking the Name property of the current object ($_.Name), splitting it using the underscore as a delimiter, and then using the first item (actually zeroth item, remember base 0) in the resulting array. This gives us groups of directores where the group is based on the software version.

So now we have two groups. But what does the group contain? Remember, in PowerShell everything is an object. So the groups are just collections of the objects. As such, the items in the groups can be treated the same way as a directory, since the items are the directories.

We can now use the ForEach-Object cmdlet to iterate through each item in each group.
PS C:\> Get-ChildItem | Sort-Object Name -Descending | Group-Object { $_.Name.split("_")[0] } |
ForEach-Object { Select-Object -InputObject $_ -ExpandProperty Group | Select-Oject -Skip 2 }


Directory: Microsoft.PowerShell.Core\FileSystem::C:\

Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100908
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100907
d---- 10/10/2010 10:10 PM <DIR> 2.0.00.00_devel-20100906
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100908
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100907
d---- 10/10/2010 10:10 PM <DIR> 1.2.00.00_devel-20100906


Let's look at the command inside the ForEach-Object script block, since that is the guts of the command.
Select-Object -InputObject $_ -ExpandProperty Group | Select-Oject -Skip 2


The Current Pipeline Object, represented as $_, contains a group. So we need to take that group, and remove the first two directories from the collection. We are then left with just the directories we want to delete. The Select-Object cmdlet is used to expand the group back into directory objects. That output is piped into another Select-Object cmdlet which removes (skips) the first two items, and leaves us with the directories to be deleted.

Now we have the directories we want to delete, so we can pipe the whole thing into Remove-Item. But before we do, let's make sure we have the correct directories, so we use the -WhatIf switch.
PS C:\> Get-ChildItem | Sort-Object Name -Descending | Group-Object { $_.Name.split("_")[0] } |
ForEach-Object { Select-Object -InputObject $_ -ExpandProperty Group | Select-Oject -Skip 2 } |
Remove-Item -WhatIf

What if: Performing operation "Remove Directory" on Target "C:\2.0.00.00_devel-20100908".
What if: Performing operation "Remove Directory" on Target "C:\2.0.00.00_devel-20100907".
What if: Performing operation "Remove Directory" on Target "C:\2.0.00.00_devel-20100906".
What if: Performing operation "Remove Directory" on Target "C:\1.2.00.00_devel-20100908".
What if: Performing operation "Remove Directory" on Target "C:\1.2.00.00_devel-20100907".
What if: Performing operation "Remove Directory" on Target "C:\1.2.00.00_devel-20100906".


If you want to perform the deletion, simply remove the WhatIf switch.

Ok, so that was the long version, what if we break it down using aliases and such to make it shorter.
PS C:\> ls | sort Name -desc | group { $_.Name.split("_")[0] } |
% { select -input $_ -expand Group | select -s 2 } | rm


And that's how we roll, PowerShell Style!

You really can't touch this!

Davide Brini has done it again:

ls -r | awk -F- '$1!=v{c=0; v=$1} {c++} c>2' | xargs rm -rf

Let me explain that awk for the two or three of you out there who may be having problems decoding it:


  • The "-F-" tells awk to split its input on hyphen ("-") instead of white space. So for each line of input from the ls command, $1 will be the version string and $2 will be the date stamp.

  • The awk code uses two variables: "c" is a line count, and "v" is the current version string we're working on.

  • The first section of code, "$1!=v{c=0; v=$1}", checks to see if the version string in the current line of input is different from the last version string we saw ("$1!=v"). If so, then the code block gets executed and the line counter variable is reset to zero and v is set to the new version string ("{c=0; v=$1}").

  • The next bit of code, "{c++}", is executed on every line of input and just increments the line counter.

  • The last expression, "c>2", means match the case where the line counter is greater than 2-- in other words when we're on the third or higher line of ls output for a particular version string (remember c gets reset every time the version string changes). Because there's no code block after the logical expression, "{print}" is assumed and the line gets output.


So the net result is that the awk expression outputs the directories we want to remove, and we just pipe that output into xargs like we did with the output of the for loop in the original solution.

Easy as pie...