Tuesday, October 19, 2010

Episode #117: Know When to Stop

Hal doesn't want to know:

We received an interesting challenge from loyal reader Joel Dinel:
My coworker has a directory which contains other sub-directories. These sub-directories contain files, but might also contain other sub-directories. My buddy would like to find, for each subdir of the current working directory, the most recent file. He would like to ignore any directories contained in sub-dirs.

Why Joel's "friend" wants to do this is a question perhaps best left unanswered. Ours is not to question why here at Command Line Kung Fu, we're merely here as a public resource for your oddest command-line challenges.

I pondered Joel's problem for a while trying to figure out how best to get started. I thought about using find to locate all of the regular files from the current directory downwards, but that won't give them to me in date-sorted order. I thought about looping over each directory and parsing the output of "ls -t", but filtering out the sub-sub-directories (if they happened to be the most recent object) could get complicated.

Then I thought that maybe I was over-thinking the problem and I should just try the naive approach:

$ ls -t */*
sub1/file2
sub2/file3
sub3/file1
sub3/file2
sub3/file3
sub1/file3
sub2/file1
sub2/file2
sub1/file1

sub3/otherdir:

sub2/subdir:

sub1/subsub:
subsubsub
file1

This actually looks surprisingly promising. The regular files from all of the sub-directories are listed first in date-sorted order, and then we get listings of the sub-sub-directories separated by blank lines. So all I have to do here is pull out the first listed file from each sub-directory and stop when I hit the first blank line.

After getting shown up by Davide Brini last weekDavide Brini's brilliant solution to last week's challenge, I had awk on the brain and I stole a page or two from Davide's playbook:

$ ls -t */* | awk -F/ '/^$/ {exit} !seen[$1] {seen[$1]=1; print}'
sub1/file2
sub2/file3
sub3/file1

The first part of the awk code, "/^$/ {exit}", simply terminates the program when we hit a blank line. We want to do this first, so we terminate the program immediately and don't produce any output when we hit the blank line.

The interesting stuff is happening in the rest of the awk expression. I'm splitting my input on slashes ("-F/"), so $1 in this case will be set to the name of the sub-directory. I'm keeping an array called "seen", which is indexed by sub-directory name. If there is not yet an array entry for a particular sub-dir, then this must be the first (and therefore most recent) file listed for that subdirectory. So we output that line ("print"), and update the "seen" array to indicate that we've output the most recent file for the sub-directory. I won't say it's obvious, but it's very compact and clean.

Honestly, this works fine on my test directory, but I was a little worried that the initial shell glob, "ls -t */*", would choke and die on a large directory structure. However, when I tested my solution in directories like /usr, /usr/include, and /proc and couldn't make it fail.

During my testing I did notice one issue though:

$ ls -l
total 224
drwxr-xr-x 2 root root 53248 2010-09-27 09:22 bin
drwxr-xr-x 2 root root 4096 2010-05-01 06:56 games
drwxr-xr-x 53 root root 12288 2010-09-27 07:55 include
drwxr-xr-x 231 root root 69632 2010-10-02 15:26 lib
drwxr-xr-x 39 root root 32768 2010-09-27 07:54 lib32
lrwxrwxrwx 1 root root 3 2009-11-30 17:25 lib64 -> lib
drwxr-xr-x 14 root root 4096 2010-09-29 11:53 local
drwx------ 2 root root 16384 2009-11-30 17:24 lost+found
drwxr-xr-x 2 root root 12288 2010-10-02 15:26 sbin
drwxr-xr-x 332 root root 12288 2010-09-27 09:10 share
drwxrwsr-x 6 root src 4096 2010-06-05 08:11 src
$ sudo ls -t */* | awk -F/ '/^$/ {exit}; !seen[$1] {seen[$1]=1; print}'
lib64/libgdiplus.so
lib/libgdiplus.so
bin/xulrunner
sbin/a2dismod
lib32/libbz2.so.1.0
include/ldap_features.h
games/same-gnome

Notice that lib64 is a symlink to lib. The glob happily follows the symlink and reports that the file libgdiplus.so is the most recent file in each of these two "directories" when really we're talking about the same file in the same directory twice.

If that's a problem for you, then we'll have to do this the hard way:

# for d in $(ls -F | egrep '/$'); do 
echo $d$(ls -tF $d | egrep -v '[@/]$' | head -1);
done | egrep -v '/$'

bin/vmss2core
games/same-gnome
include/ldap_features.h
lib/libgdiplus.so.0.0.0
lib32/libbz2.so.1.0.4
sbin/vmware-authd

The "ls -F" command puts a special character after each file name. Directories are tagged with a trailing "/", which I match with egrep to get a list of only the sub-directories of the current directory. On Linux I could have just used "find . -maxdepth 0 -type d", but "-maxdepth" is not supported on all versions of find. In any event, I'm taking my list of directories and iterating over it with a for loop.

Now "ls -t <somedir> | head -1" is a useful idiom for getting the name of the most recently modified object in a directory. But that object might be a sub-directory and we don't want that in this case. So I'm once again using the "-F" option with ls to indicate the type of file and then using egrep to filter out both directories ("/") and symlinks ("@") before I pipe things into head. This will give me the name of the most recently modified file (or device or named pipe or whatever-- you could filter those out too if you wanted), but I need to prepend the directory name.

The problem is that in some cases the output of my ls pipeline can be null-- like when the directory is empty or contains only sub-directories and/or symlinks. So I'm using the funny echo construct you see in the loop above rather than something like "echo -n $d; ls -tF ..." just so that I can guarantee that we get a terminating newline. However in the cases where the output of the ls pipeline is null I'll just have a sub-directory name and a trailing slash but no actual file name. So I added one more egrep after the loop to filter out these unnecessary lines of output.

In fact you'll notice that the output of my loop solution is considerably different from the output of the awk solution. Adding a "| xargs ls -l" to each expression makes the issues clearer:

# ls -t */* | awk -F/ '/^$/ {exit}; !seen[$1] {seen[$1]=1; print}' | xargs ls -l
lrwxrwxrwx 1 root root 27 2010-09-27 07:58 bin/xulrunner -> /etc/alternatives/xulrunner
-r-xr-sr-x 1 root games 103344 2010-02-18 00:01 games/same-gnome
-rw-r--r-- 1 root root 1890 2010-07-29 17:50 include/ldap_features.h
lrwxrwxrwx 1 root root 15 2010-09-27 07:54 lib32/libbz2.so.1.0 -> libbz2.so.1.0.4
lrwxrwxrwx 1 root root 19 2010-10-02 15:26 lib64/libgdiplus.so -> libgdiplus.so.0.0.0
lrwxrwxrwx 1 root root 19 2010-10-02 15:26 lib/libgdiplus.so -> libgdiplus.so.0.0.0
lrwxrwxrwx 1 root root 7 2010-09-27 07:55 sbin/a2dismod -> a2enmod
# for d in $(ls -F | egrep '/$'); do
echo $d$(ls -tF $d | egrep -v '[@/]$' | head -1);
done | egrep -v '/$' | xargs ls -l

-rwxr-xr-x 1 root root 857848 2010-09-27 07:37 bin/vmss2core
-r-xr-sr-x 1 root games 103344 2010-02-18 00:01 games/same-gnome
-rw-r--r-- 1 root root 1890 2010-07-29 17:50 include/ldap_features.h
-rw-r--r-- 1 root root 70076 2010-09-10 14:36 lib32/libbz2.so.1.0.4
-rw-r--r-- 1 root root 410184 2010-09-23 08:26 lib/libgdiplus.so.0.0.0
-rwsr-xr-x 1 root root 871296 2010-09-27 07:35 sbin/vmware-authd

The awk solution doesn't filter out symlinks like my for loop version does. I can fix things up a little bit by employing the same "ls -F" trick we used in the loop version:

# ls -tF */* | awk -F/ '/^$/ {exit}; !/@$/ && !seen[$1] {seen[$1]=1; print}'
bin/vmss2core*
sbin/vmware-authd*
lib64/libgdiplus.so.0.0.0
lib/libgdiplus.so.0.0.0
lib32/libbz2.so.1.0.4
include/ldap_features.h
games/same-gnome*

Do you see that I've added an extra pattern match in the awk code-- "!/@$/"-- to filter out the symlinks? Of course now I've got the trailing "*" markers for executable files. A little sed will clean that up:

# ls -tF */* | awk -F/ '/^$/ {exit}; !/@$/ && !seen[$1] {seen[$1]=1; print}' | 
sed 's/\*$//' | sort

bin/vmss2core
games/same-gnome
include/ldap_features.h
lib32/libbz2.so.1.0.4
lib64/libgdiplus.so.0.0.0
lib/libgdiplus.so.0.0.0
sbin/vmware-authd
# for d in $(ls -F | egrep '/$'); do
echo $d$(ls -tF $d | egrep -v '[@/]$' | head -1);
done | egrep -v '/$'

bin/vmss2core
games/same-gnome
include/ldap_features.h
lib/libgdiplus.so.0.0.0
lib32/libbz2.so.1.0.4
sbin/vmware-authd

I also threw in a sort command to make it easier to compare the output of the new awk expression with the the for loop output. The only real discrepancy now is that our glob in the awk version is still following the symlink and giving us output about the lib64 directory when it probably shouldn't. So you could say that the for loop version is more correct, but the awk solution "works" in most cases and is easier to type.

Two solutions from me this week. Why am I thinking that coming up with just one is going to be rough for Tim this time around?

Tim only digs so deep

Silly Hal, this one isn't bad at all. Here is what I came up with:

PS C:\> ls | ? { $_.PSIsContainer} | % { (ls $_ | ? { -not $_.PSIsContainer} | sort LastWriteTime -Desc)[0] }

Directory: Microsoft.PowerShell.Core\FileSystem::C:\dir1

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 10/17/2010 12:59 PM 12 ccc.txt


Directory: Microsoft.PowerShell.Core\FileSystem::C:\dir2

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 10/17/2010 1:07 PM 12 zzz.txt
The first part of this command is just a directory listing that filters for directories (containers).

PS C:\> ls | ? { $_.PSIsContainer}

Directory: Microsoft.PowerShell.Core\FileSystem::C:\temp

Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 10/17/2010 12:59 PM <DIR> dir1
d---- 10/17/2010 1:07 PM <DIR> dir2
The next portion is where the magic happens. We use the ForEach-Object cmdlet (alias %) to iterate through each subdirectory under our working directory. The iterator is the current pipeline object ($_). In our case, the current pipeline object will first contain dir1, and then dir2.

Inside our ForEach-Object script block, we get a listing of the contents in our subdirectory, and filter out objects that are not containers (directories). The objects (files) are then sorted by the LastWriteTime in reverse order. Finally, we select the first object. Remember, we are working in base zero, so the first object has an index of 0.

That's it, not so bad is it Hal!