Tuesday, June 28, 2011

Episode #151: Pathological PATHs

Hal gets some email

Spring is now officially over, but apparently it's never too late to think about Spring Cleaning... of your PATH that is. Jeff Haemer writes in to say that he often adds new directories to his search path with the "PATH+=:/some/new/dir" idiom (aka "PATH=$PATH:/some/new/dir"). But the problem is that if you do this frequently and indiscriminately, you can end up with lots of redundancy in your search path:

$ echo $PATH
/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/local/sbin:
/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/hal/bin:/sbin:/usr/sbin:/bin:
/usr/bin:/sbin:/usr/sbin:/bin:/usr/bin:/sbin:...

For bash, the redundancy here isn't really a huge factor since executable locations are automatically cached by the shell, specifically to avoid having to traverse the entire search path every time you run a program. Still, all those duplicates do make it difficult to see if a specific directory you're looking for is already included in $PATH or not.

So Jeff's email got me thinking about ways to reduce $PATH to be just the unique directory entries. My first idea ran along these lines:

$ echo $PATH | tr : \\n | sort -u | tr \\n : | sed 's/:$/\n/'
/bin:/home/hal/bin:/sbin:/usr/bin:/usr/games:/usr/local/bin:/usr/local/sbin:/usr/sbin:/usr/X11R6/bin

I first use "tr" to change the colon separators to newlines, essentially forcing each element of $PATH onto its own line. Then it's just a simple matter of using "sort -u" to reduce the list to only the unique elements. I then use "tr" again to turn the newlines back into colons. The only problem is that the very last newline also ends up becoming a colon, which isn't really what we want. So I added one last "sed" statements to the end in order to take care of that problem.

This definitely gives us only the unique path elements, but unfortunately it reorders them as well. Since directory order can be very significant when it comes to your search path, it seems like a different solution is warranted. So I decided to take matters into my own hands:

$ declare -A p
$ for d in $(echo $PATH | sed 's/:/ /g'); do
[[ ${p[$d]} ]] || echo -n $d:;
p[$d]=1;
done | sed 's/:$/\n/'

/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/games:/home/hal/bin

First I use "declare -A" to initialize a new associative array-- that is an array indexed with strings rather than numbers. I'll use the array "p" to track the directories I've already seen.

At the top of my for loop, I'm using sed to convert the colons in my path to spaces so that the loop will iterate over each separate directory element in $PATH. Inside the loop, I check to see if I've already got an entry in my array "p" for the given path element. If I don't then I output the new directory followed by a colon, but I make sure to use "echo -n" so I don't output a newline. I also make sure to update "p" to indicate that I've already seen and output the directory name.

Like my last example, however, this is going to give me final output that's terminated by a colon, but no newline. So I use the same "sed" fixup I did before so that the output looks nice.

It's a little scripty, but it gets the job done. I'm sure I could accomplish the same thing with some similar looking awk code, but it was fun trying to do this with just shell built-ins.

Tim, how's your late Spring Cleaning going?

Tim forgot to clean

Silly Hal and his cleanliness. Doesn't he know that us nerds don't like to be clean. Of course, we tend to be a bit anal retentive and we will definitely need to clean up our Path before picking up all the pizza boxes off the floor. Let's see what my path looks like:

PS C:\> $env:path
%SystemRoot%\system32\WindowsPowerShell\v1.0\;C:\Windows\system32;C:\Windows;
C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\system32;
C:\Windows\System 32\WindowsPowerShell\v1.0\;C:\Windows\system32;
C:\Program Files (x86)\QuickTime\QTSystem\


Uh oh, it's a little ugly and we can clean up the redundant bits. The easiest cleaning method is similar to what Hal did: split, sort, remove duplicates, rejoin.

PS C:\> ($env:path.split(';') | sort -Unique) -join ";"
%SystemRoot%\system32\WindowsPowerShell\v1.0\;
C:\Program Files (x86)\QuickTime\QTSystem\;C:\Windows;C:\Windows\system32;
C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\


We take the path (a string object) and use the split method to split on the semicolons. The results are piped into Sort-Object (alias sort) where the Unique switch is used to remove duplicates. Finally, the array of objects is passed to the Join operator to combine the items adding a semicolon between each item. Of course, we end up with the same problem that Hal had, a path that is out of order.

Fortunately, we can use a little trick with the Group-Object cmdlet (alias group) to find and remove duplicates.

PS C:\> $env:path.split(';') | group

Count Name
----- ----
1 %SystemRoot%\system32\WindowsPowerShell\v1.0\
4 C:\Windows\system32
3 C:\Windows
1 C:\Windows\System32\Wbem
1 C:\Windows\System32\WindowsPowerShell\v1.0\
1 C:\Program Files (x86)\QuickTime\QTSystem\


Notice that the items stay in order, so all we have to do is output the Name property and recombine the items.

PS C:\> ($env:path.split(';') | group | % {$_.Name}) -join ';'


The ForEach-Object cmdlet (alias %) is used to iterate through each group and output the Name property. The resulting array of strings are again joined via the Join operator and we have our fixed path.

Yeah all clean. Now to figure out what to do with all these pizza boxes.