Tuesday, November 8, 2011

Episode #161: Cleaning up the Joint

Hal's got email

Apparently tired of emailing me after we post an Episode, Davide Brini decided to write us with a challenge based on a problem he had to solve recently. Davide had a directory full of software tarballs with names like:

package-foo-10006.tar.gz
package-foo-10009.tar.gz
package-foo-8899.tar.gz
package-foo-9998.tar.gz
package-bar-3235.tar.gz
package-bar-44328.tar.gz
package-bar-4433.tar.gz
package-bar-788.tar.gz

As the packages accumulate in the directory, Davide wanted to be able to get rid of everything but the most recent three tarballs. The trick is that we're only allowed to rely on the version number that's the third component of the file pathname, and not file metadata like the file timestamps. And of course our final solution should work no matter how many packages are in the directory or what their names are, and no matter how many versions of each package currently exist in the directory.

The code I used to create my test cases is actually longer than my final solution. Here's the quickie I tossed off to create a directory of interesting test files:

$ for i in one two three four five; do 
for j in {1..5}; do
touch pkg-$i-$RANDOM.tar.gz;
done;
done

$ ls
pkg-five-20690.tar.gz pkg-four-6945.tar.gz pkg-three-29078.tar.gz
pkg-five-22215.tar.gz pkg-one-16581.tar.gz pkg-three-31807.tar.gz
pkg-five-24754.tar.gz pkg-one-18962.tar.gz pkg-two-1461.tar.gz
pkg-five-27332.tar.gz pkg-one-25712.tar.gz pkg-two-14713.tar.gz
pkg-five-3200.tar.gz pkg-one-5325.tar.gz pkg-two-23569.tar.gz
pkg-four-12855.tar.gz pkg-one-8421.tar.gz pkg-two-28329.tar.gz
pkg-four-14868.tar.gz pkg-three-11196.tar.gz pkg-two-526.tar.gz
pkg-four-17282.tar.gz pkg-three-15935.tar.gz
pkg-four-19436.tar.gz pkg-three-25092.tar.gz

The outer loop creates the different package names, and the inner loop creates five instances of each package. To get a wide selection of version numbers, I just use $RANDOM to get a random value between 1 and 32K.

The tricky part about this challenge is that tools like "ls" will sort the file names alphabetically rather than numerically. In the output above, for example, you can see that "pkg-two-526.tar.gz" sorts at the very end of the list, even though numerically version number 526 is the earliest version in the "pkg-two" series of files.

We can use "sort" to list the files in numeric order by version number:

$ ls | sort -nr -t- -k3 
pkg-three-31807.tar.gz
pkg-three-29078.tar.gz
pkg-two-28329.tar.gz
pkg-five-27332.tar.gz
pkg-one-25712.tar.gz
pkg-three-25092.tar.gz
...

Here I'm doing a descending ("reversed") numeric sort ("-nr") on the third hypen-delimited field ("-t- -k3"). All the package names are mixed up, but at least the files are in numeric order.

Now all I have to do is pick out the the fourth and later copies of any particular package name. For this there's awk:

$ ls | sort -nr -t- -k3 | awk -F- '++a[$1,$2] > 3' 
pkg-five-20690.tar.gz
pkg-three-15935.tar.gz
pkg-four-12855.tar.gz
pkg-three-11196.tar.gz
pkg-one-8421.tar.gz
pkg-four-6945.tar.gz
pkg-one-5325.tar.gz
pkg-five-3200.tar.gz
pkg-two-1461.tar.gz
pkg-two-526.tar.gz

The "-F-" option tells awk to split its input on the hyphens. I'm using "++a[$1,$2]" to count the number of times I've seen a particular package name. When I get to the fourth and later entries for a given package, then my conditional statement will be true. Since I don't specify an action to take, the default assumption is "{print}" and the file name gets printed. Stick that in your awk pipe and smoke it, Davide!

Removing the files instead of just printing their names is easy. Just pipe the output into xargs:

$ ls | sort -nr -t- -k3 | awk -F- '++a[$1,$2] > 3' | xargs rm -f
$ ls
pkg-five-22215.tar.gz pkg-four-19436.tar.gz pkg-three-29078.tar.gz
pkg-five-24754.tar.gz pkg-one-16581.tar.gz pkg-three-31807.tar.gz
pkg-five-27332.tar.gz pkg-one-18962.tar.gz pkg-two-14713.tar.gz
pkg-four-14868.tar.gz pkg-one-25712.tar.gz pkg-two-23569.tar.gz
pkg-four-17282.tar.gz pkg-three-25092.tar.gz pkg-two-28329.tar.gz

I've used the "-f" option here just so that we don't get an error message when we run the command and there end up being no files that need to be removed.

And that's my final answer, Regis... er, Davide! Thanks for a fun challenge! To make things really interesting for Tim, I think we should make him do this one in CMD.EXE, don't you?

Tim thinks Hal is mean

Not only does Hal throw down the gauntlet and request CMD.EXE, but he makes this problem more difficult by making this two challenges in one. Not being one to turn down a challenge (even though I should), we start off with PowerShell by creating the test files:

PS C:\> foreach ($i in "one","two","three","four","five" ) {
foreach ($j in 1..5) {
Set-Content -Path "pkg-$i-$(Get-Random -Minimum 1 -Maximum 32000).tar.gz" -Value ""
} }


PS C:\> ls

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/1/2011 1:23 PM 2 pkg-five-19410.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-21426.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-26739.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-27296.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-6618.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-18533.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-25925.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-31089.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-511.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-8343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-13225.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-24343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-2835.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-308.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-4484.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-13226.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-15026.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-23830.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-30553.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-4311.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-12923.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27368.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27692.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-28727.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-3888.tar.gz


Similar to what Hal did, we use multiple loops to create the files. Set-Content is used to create the file. The filename is a little crazy as we need to use the output of Get-Random in our path. The $() is used to wrap the cmdlet and only return the output.

I feel a big like a ditch digger who is tasked with filling in the ditch he just dug, but that's the challange. We have files and some need to be deleted.

We start off grouping the files based on their package and sorting them by their version.

PS C:\> ls | sort {[int]($_.Name.Split("-.")[2])} -desc |
group {$_.Name.Split("-.")[1]}


Count Name Group
----- ---- -----
5 four {pkg-four-31089.tar.gz, pkg-four-25925.tar.gz, pkg-four-1853...
5 three {pkg-three-30553.tar.gz, pkg-three-23830.tar.gz, pkg-three-1...
5 two {pkg-two-28727.tar.gz, pkg-two-27692.tar.gz, pkg-two-27368.t...
5 five {pkg-five-27296.tar.gz, pkg-five-26739.tar.gz, pkg-five-2142...
5 one {pkg-one-24343.tar.gz, pkg-one-13225.tar.gz, pkg-one-4484.ta...


The package and version number are retrieved by using the Split method using dots and dashes as delimiters. The version is the 3rd item (index 2, remember, base zero) and the package is the 2nd (index 1). The version is used to sort and the package name is used for grouping.

At this point we have groups that contain the files sorted, in descending order, by the version number. Now we need to get all but the first two items.

PS C:\> ls | sort {[int]($_.Name.Split("-.")[2])} -desc |
group {$_.Name.Split("-.")[1]} | % { $_.Group[2..($_.Count)]}


Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/1/2011 1:23 PM 2 pkg-four-18533.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-8343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-511.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-15026.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-13226.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-4311.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27368.tar.gz
...


The ForEach-Object cmdlet (alias %) is used to operate on each group. As you will remember, the items in the group are sorted in descending order by the version number. We need to select the 3rd through the last item, and this is accomplished by using the Range operator (..) with our collection of objects. The Range of 2..($_.Count) gives us everything but the first two items. Technically, I have an off-by-one issue with the upper bound, but PowerShell is kind enough not to barf on me. I did this to save a few key strokes; although, I am using a lot more key strokes to justify my laziness. Ironic? Yes.

All we have to do now is pipe it into the Remove-Item (alias del, erase, rd, ri, rm, rmdir).

PS C:\> ls | sort {[int]($_.Name.Split("-.")[2])} -desc |
group {$_.Name.Split("-.")[1]} | % { $_.Group[2..($_.Count)]} | rm


PS C:\> ls

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/1/2011 1:23 PM 2 pkg-five-26739.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-27296.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-25925.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-31089.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-13225.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-24343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-23830.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-30553.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27692.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-28727.tar.gz


Not too bad, but now it is time for the sucky part.

CMD.EXE

Here is the file creator:

C:\> cmd /v:on /c "for /r %i in (pkg-one pkg-two pkg-three pkg-four pkg-five) do
@for /l %j in (1,1,5) do @echo "" > %i-!random!.tar.gz"


Similar to the previous examples, this uses two loops to write our file.

Now for the beast to nuke the old packages...

C:\> cmd /v:on /c "for /f %a in ('^(for /f "tokens=2 delims=-." %b in ^('dir /b *.*'^) do
@echo %b ^) ^| sort') do @set /a first=0 > NUL & @set /a second=0 > NUL & @(for /f "tokens=1,2,3,*
delims=-." %i in ('dir /b *.* ^| find "%a"') do @set /a v=%k > NUL & IF !v! GTR !first! (del
%i-%j-!second!.tar.gz && set /a second=!first! > NUL && set /a first=!v! > NUL) ELSE (IF !v! GTR
!second! (del %i-%j-!second!.tar.gz && set /a second=!v! > NUL) ELSE (del %i-%j-!v!.tar.gz)))"


C:\> dir
Volume in drive C has no label.
Volume Serial Number is DEAD-BEEF

Directory of C:\

11/01/2011 01:23 PM <DIR> .
11/01/2011 01:23 PM <DIR> ..
11/01/2011 01:23 PM 2 pkg-five-26739.tar.gz
11/01/2011 01:23 PM 2 pkg-five-27296.tar.gz
11/01/2011 01:23 PM 2 pkg-four-18533.tar.gz
11/01/2011 01:23 PM 2 pkg-four-8343.tar.gz
11/01/2011 01:23 PM 2 pkg-one-13225.tar.gz
11/01/2011 01:23 PM 2 pkg-one-24343.tar.gz
11/01/2011 01:23 PM 2 pkg-three-23830.tar.gz
11/01/2011 01:23 PM 2 pkg-three-30553.tar.gz
11/01/2011 01:23 PM 2 pkg-two-27692.tar.gz
11/01/2011 01:23 PM 2 pkg-two-28727.tar.gz
10 File(s) 20 bytes
2 Dir(s) 1,234,567,890 bytes free


As this command is barely decipherable, I'm not going to go through it in great detail, but I will describe it at a high level.

We start off by enabling delayed variable expansion so we can set and immediately use a variable. We then use a trusty For loop (actually, I don't trust the sneaky bastards) to find the package names. We then use another For loop to work with each file that matches the current package by using a directory listing plus the Find command. Now is where it get really hairy...

We need to keep the two files with the highest version number. To do this we use two variables, First and Second, to hold the two highest version numbers. Both variables are initialized to zero. Next we need to do some crazy comparisons.

1. If the version number of the current file for the current package is greater than First, we delete the file related to Second, move First to Second, and set First equal to the current version.

2. If the version number of the current file for the current package is less than First but greater than Second, we delete the file related to Second and set Second equal to the current version.

3. If the version number of the current file for the current package is less than both First and Second then the file is deleted.

Ok, Hal, you have your CMD.EXE. I would say "TAKE THAT", but I'm pretty sure I'm the one that was taken.