Wednesday, March 18, 2009

Episode #12 - Deleting Related Files

Hal Says:

I had been deliberately holding back on this problem because I didn't want to make things too tough on Ed and that poor excuse for a command shell he's been saddled with. But since he had the temerity to suggest that Unix wasn't a "real operating system" back in Episode #11 (who needs to track file creation times anyway?), the gloves have come off.

So today's problem is as follows: Delete all files whose contents match a given string AND ALSO delete a related, similarly named file in the same directory. For example, you've got a lot of spam in your Sendmail /var/spool/mqueue directory and you need to match the spammer's email address in the qf<queueID> file and then delete both the qf<queueID> file (header and delivery info) and the df<queueID> file (message contents).

Getting the matching file names is just a matter of using "grep -l", and obtaining the queue ID values from the file names is just a matter of using "cut":

# grep -l spammer@example.com qf* | cut -c3-

Add a tight loop and you're done:

# for i in `grep -l spammer@example.com qf* | cut -c3-`; do rm qf$i df$i; done

And, finally, I'll administer the coup de grace by using xargs instead of a loop:

# grep -l spammer@example.com qf* | cut -c3- | xargs -I'{}' rm qf{} df{}

So, Skodo, think you're ready to play with the big-time shells?

Ed (aka Skodo) responds:

Hal says he "Didn't want to make things too tough on Ed..." Well, thank you for your niceties, but easy-to-use and sensical command shells are for wimps. "Big-time shells..." I wonder if we count the number of copies of cmd.exe in the universe and compare it to the number of bash shells, which would come out "big-time"? Still, I do have to confess, cmd.exe is about the most uglified and frustrating shell ever devised by man. But, I can take care of your so-called challenge with the following trivial-to-understand command:

C:\> cmd.exe /v:on /c "for /f %i in ('findstr /m spammer@example.com qf*') do @set
stuff=%i & del qf!stuff:~2! & del df!stuff:~2!"
Although an explanation of this really straightforward command probably isn't necessary (it's pretty obvious, no?), I'll go ahead and insert one just for completeness. I'll start in the middle, work my way through the end, and wrap around to the beginning.

Putting all sarcasm aside, I'm doing a bunch of gyrations in this command to get really flexible string parsing beyond what I can get with normal Windows FOR loops. I start out in the middle by running the findstr command, with the /m option, which makes it find the name of files that contain the string "spammer@example.com" at least one time. I'm looking only through files called qf*. The output of the findstr command will be one qf file name per line. The findstr command will run inside the FOR /F loop because I put it inside of forward single quotes (' '), with the iterator variable %i taking on the value of each of the lines of the output of findstr.

So far, so good. But, now we get to the fu part here, and I really mean FU. Originally, I considered parsing %i using another FOR /F loop to rip it apart as a string, so I could peel off the qf in front to get the unique part of the file name. However, that won't work nicely, because FOR /F parsing cannot do substrings. So, I briefly thought about defining the letters q and f as delimiters in my FOR /F so I could parse them off, but the remainder of the file name may have those letters in them as well, which means I would miss some files with my over-exuberant FOR /F q and f delimiters. There must be another way, one that lets us get substrings.

Clearly, we need better parsing of the %i variable. What to do? Well, we can't apply substring parsing directly to iterator variables of FOR loops, because substring parsing is only available for environment variables. I wish we could just sub-stringify %i, but it doesn't work. Instead, we can assign its value to an environment variable, which I've called "stuff". Then, we can parse stuff to snip off the first two characters (the q and the f) using !stuff:~2!. I then delete the files referred to with qf!stuff:~2! and df!stuff:~2!.

But, what's that monstrosity up front with the cmd.exe /v:on /c? Well, cmd.exe does immediate environment variable expansion by default, expanding our stuff variable immediately as the command is invoked. We want delayed expansion, so that stuff can take on different values as our loop iterates. We do that by first invoking a cmd.exe with /v:on to tell it to do delayed environment variable expansion, to execute a command for us (/c), with that command being our FOR loop. All of that nonsense, just to get flexible variable parsing. But, this parsing is pretty useful, especially when combined with FOR /F string parsing. But don't get me started on that.

So, there you have it. Lots of fun little gems in this one. Thanks for the challenge, Hal. Inspired by your post, I'm now going to install sendmail on a Windows box and write an anti-spam tool using the above command.... NOT!

Special Guest Fu from @jaykul:

@jaykul, a PowerShell master, provided this useful PowerShell command to implement a solution to Hal's challenge:

#PowerShell> sls spammer@example.com -list -path qf* | rm -path {$_.Path -replace "\\qf",
"\[qd]f"}
@jaykul helpfully notes that sls stands for select string.

Ed comments: It's amazing how much simpler and more elegant PowerShell is compared to cmd.exe. I only wish we had it 10 years ago, and could rely on it being widely deployed now! Faster, please!