Tuesday, September 13, 2011

Revisiting Episode #151: Readers' Revenge!

Hal's a football widow

Well it's the start of football season here in the US, and Tim's locked himself in his "Man Cave" to catch all of the action. For our readers outside the US, our version of football is played with a "ball" that isn't at all round and which rarely touches the players' feet. We just called it football to confuse the rest of the world.

Since football really isn't my sport, I figured I'd spend some time this weekend catching up on reader responses to some of our past Episodes. Back in Episode #151 I sort of threw down the gauntlet at the end of my solution when I stated, "I'm sure I could accomplish the same thing with some similar looking awk code, but it was fun trying to do this with just shell built-ins." I figured that mention of an awk solution would bring an email from Davide Brini, and in this I was not disappointed.

Davide throws down

Let's just get straight to the awk, shall we:

echo -n $PATH | awk 'BEGIN { RS = ":" }; 
!a[$0]++ { printf "%s%s", s, $0; s = RS };
END { print "" }'

There's some sneaky clever bits here that bear some explanation:


  • In the BEGIN block, Davide is setting "RS"-- the "record separator" variable-- to colon. That means awk will treat each element of our input path as a separate record, automatically looping over each individual element and evaluating the statement in the middle of the example above.


  • That statement begins with a conditional operator, "!a[$0]", combined with an auto-increment, "++". In the conditional expression, "a" is an associative array that's being indexed with the elements of our $PATH. "$0" is the current "record" in the path that awk is giving us. So "!a[$0]" is true if we don't already have an entry for the current $PATH element in the array "a".


  • True or false, however, the auto-increment operator is going to add one to the value in "a[$0]", ensuring that if we run into a duplicate later in $PATH then the "!a[$0]" condition will return false.


  • If "!a[$0]" is true (it's the first time we've encounted a given directory in $PATH), then we execute the block after the conditional. That prints the value of variable "s" followed by the directory name, "$0". The first time through the loop, "s" will be null and we just print the directory. However, the second statement in the loop sets "s" to be colon (the value of "RS"), so in future iterations we'll print a colon before the directory name, so that everything gets nicely colon-separated.


  • In the end block, we output a null value. But this has the side effect of spitting out a newline at the end of our output, which makes things more readable.


Phew! That's some fancy awk, Davide. Thanks for the code!

Who let the shells out?

What I wasn't expecting was a note from loyal reader Daniel Miller, who took my whole "just shell built-ins" comment quite seriously. I had some sed mixed up in my final solution, but Daniel provided the following shell-only solution:

$ declare -A p
$ for d in ${PATH//:/ }; do [[ ${p[$d]} ]] || u[$((c++))]=$d; p[$d]=1; done
$ IFS=:
$ echo "${u[*]}"
/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/games:/home/hal/bin
$ unset IFS

I am limp with admiration. Daniel replaces the sed in my loop with the shell variable substitution operator "${var/.../...}" that we've used in previous Episodes. The clever bit, though, is that he's added a new array called "u" to the mix to keep track of the unique directory names, in order, as we progress through the elements of $PATH.

Inside the loop we check our associative array "p" as before to see whether we've encountered a given directory, $d, or not. If this is the first time we've seen $d, then "[[ ${p[$d]} ]]" will be false, and so we'll execute the statement after the "||", which adds the directory name to our array "u". The clever bit is the "$((c++))" in the array index, which uses "c" as an auto-incrementing counter variable to keep extending the "u" array as necessary to add new directory names.

You'll notice, however, that we're not outputting anything inside the loop. After the loop is finished, Daniel uses "echo "${n[*]}"" to output all of the elements of "n" with a single statement. The neat thing about the "${n[*]}" syntax is that it uses the first character of IFS to separate the array elements as they're being printed. So Daniel sets IFS to colon before the echo statement-- and then unsets it afterwards because having IFS set to colon is surely going to mess up later commands! In fact, Daniel suggests putting all of this mess into a shell function where you can declare IFS as a local variable and not mess up other commands.

Anyway, thanks as always to our readers for their efforts to improve our humble shell efforts. I'll see if I can drag Tim out of the Man Cave in time for next week's Episode...