Tuesday, March 22, 2011

Episode #139: Text or Video... or Both?

Hal's in the mailbag again

Recently we got a request from Seth Feldman. Seth's trying to organize his directory of conference videos, which is structured like:

./BHDC2011/Larimer/info.txt
./BHDC2011/Larimer/talk.avi
./SANS2011/Pomeranz/Rumba.avi
./SANS2011/Medin/Jitterbug.txt
./SANS2011/Skodo/AchyBreaky.avi
./SANS2011/Skodo/Lambada.txt
./SANS2011/Skodo/SalsaDancing.txt
./SANS2011/Skodo/SalsaDancing.wav
./Shmoo2011/Coyne/notes.txt
./Shmoo2011/Coyne/talk.mpeg

In each "leaf" directory there will be one or more video files. Each video file should have a corresponding *.txt file that describes the video. Usually the text file shares the same base file name as the video, but not always (as you can see above). Seth wants to find all of the videos in his (extensive) collection that don't yet have a descriptive text file and/or cases where he's created the text file but hasn't had a chance to download the video.

If everything were like the SANS2011 directory in our example-- where the video file and the text description share the same base file name-- then we can go with a simpler solution:

$ find SANS2011 -type f | sed 's/\.[^.]*$//' | sort | uniq -u
SANS2011/Medin/Jitterbug
SANS2011/Pomeranz/Rumba
SANS2011/Skodo/AchyBreaky
SANS2011/Skodo/Lambada

Just find all the files, strip off the file extension, and then use "uniq -u" to find all the base file names that only appear once in the output. Unfortunately this solution fails on directories like the BHDC2011 and Shmoo2011 dirs where the files have different names, giving you a false-positive.

I could make a small mod to help the situation in this case:

$ find . -type f | sed -r 's/\/(info|notes).txt/\/talk.txt/; s/\.[^.]*$//' | 
sort | uniq -u

./SANS2011/Medin/Jitterbug
./SANS2011/Pomeranz/Rumba
./SANS2011/Skodo/AchyBreaky
./SANS2011/Skodo/Lambada

If you can count on the text file being named "info.txt" or "notes.txt" and the corresponding video to be called "talk.*", then just tweaking the sed to "rename" the *.txt file as I'm doing above will work. But I'm not sure we can count on this pattern being repeated throughout the entire directory structure.

So I went with an uglier approach:

$ find . -type d -links 2 | 
while read dir; do
a=$(ls "$dir" | wc -l);
t=$(ls "$dir"/*.txt | wc -l);
o=$(($a - $t));
[[ $o == $t ]] || echo $dir - $t txt, $o vid;
done 2>/dev/null

./SANS2011/Pomeranz - 0 txt, 1 vid
./SANS2011/Medin - 1 txt, 0 vid

Here I'm using "find . -type d -links 2" to find all of the "leaf" directories. Why does this work? First, the minimum link count on any directory is two because there's the pointer to the directory from its parent plus the "." link in the directory which points back to itself. Any time you make a subdirectory, however, that subdirectory contains a ".." link that points back to its parent, increasing the parent's link count by one. So directories with link count 2 must have no subdirs, and thus they are "leaf" directories.

I next do a loop over all of the leaf directories I find. Inside the loop I calculate the total number of files in each directory and the number of *.txt files. Then I subtract the number of text files from the total number of files, giving me the number of non-text (or "other") files. If the number of text files doesn't equal the number of other files, then output some information about the directory.

Unfortunately, while our original solution gave us false-positives, this version ends up giving us false-negatives. The ./SANS2011/Skodo directory has an orphan *.txt file and an orphan video file. But since there are the same number of orphans of each type, our "count the file types" solution doesn't flag this directory as a problem.

So which do you prefer, false-positives or false-negatives? In this case, I'm going to go with getting some false-positives, because frankly the first version of the command is a lot easier to type. But your mileage, as always, may vary.

Now if Tim can leave off his Jitterbugging for a moment, we'll see what he's got up his sleeve.

Tim jitterbugs the night away

Hal took two approaches to this episode, so I will to. We'll start off with the false negative approach which just compares the number of txt files and the number of non-txt files. Here is the command:

PS C:\videos> Get-ChildItem -Recurse | ? {
$_.PSIsContainer -and
($_.GetFiles("*.txt").Count * 2) -ne $_.GetFiles().Count
} | Select-Object FullName


FullName
--------
C:\videos\SANS2011\Medin
C:\videos\SANS2011\Pomeranz


I broke the command into multiple lines for readability, but of course this could all be on one line. We start off with the basic recursive directory listing, followed by a Where-Object filter which has two parts:
1) Directories only
2) The number of files in the directory does NOT equal double the number of text files.

The last portion of the filter may not make sense at first glance, so let me explain it a bit further. The Pomeranz directory should have 1 text file, and 1 video file. The number of txt files in the directory is 0, and doubled is still 0. This is compared to the total number of files (1). The result is not equal so the object is passed down the pipeline.

Similarly, my (Medin) directory has 1 txt file, and doubled is 2. There should be two total files in the directory, but there aren't. The values are Not Equal (ne) and the object is passed down the pipeline.

Now let's look at the SANS2011\Skodo directory. There are 2 txt files, and double that is 4. This is compared to the total number of files in the directory (4) and the results are equal. Since we want Not Equal (ne) results, this object is not passed down the pipeline. Note that non-leaf directories will have 0 txt files and 0 non-txt files and will therefor not make it through our filter.

Of course, upon closer inspection of the SANS2011\Skodo directory we see that while Ed's directory does have the right number of files, but it does not have matching pairs. If we try to match the video and file names then we can use this command:

PS C:\videos> Get-ChildItem -Recurse | ? { !$_.PSISContainer } |
Group-Object -Property BaseName,PSParentPath -NoElement | ? { $_.Count -ne 2 }


Count Name
----- ----
1 info, Microsoft.PowerShell.Core\FileSystem::C:\videos\BHDC2011\Larimer
1 talk, Microsoft.PowerShell.Core\FileSystem::C:\videos\BHDC2011\Larimer
1 Jitterbug, Microsoft.PowerShell.Core\FileSystem::C:\videos\SANS2011\Medin
1 Rumba, Microsoft.PowerShell.Core\FileSystem::C:\videos\SANS2011\Pomeranz
1 AchyBreaky, Microsoft.PowerShell.Core\FileSystem::C:\videos\SANS2011\Skodo
1 Lambada, Microsoft.PowerShell.Core\FileSystem::C:\videos\SANS2011\Skodo
1 notes, Microsoft.PowerShell.Core\FileSystem::C:\videos\Shmoo2011\Coyne
1 talk, Microsoft.PowerShell.Core\FileSystem::C:\videos\Shmoo2011\Coyne


The output isn't great, but it does get our results. The "Name" property contains the full name of the path, including the provider (FileSystem). We'll clean up the results in a bit, but let's go over the command first.

We start off getting a recursive directory listing and we filter out the directories, so we are left with a collection of all the files. We then group the objects based on their path and the base name (filename without the extension). Any group that does not have exactly two members is removed, since that is a matching pair.

We have two minor problems. First, we have a messy output. Second, we have duplicate directories since each mismatched file will produce output. We'll first strip out the Provider portion of the path (Microsoft.PowerShell.Core\FileSystem::) we'll have nicer output.

 ... | Select-Object @{Name="Directory";Expression={$_.Name -replace '.*::', ''}}
Directory
---------
C:\temp\BHDC2011\Larimer
C:\temp\BHDC2011\Larimer
C:\temp\SANS2011\Medin
C:\temp\SANS2011\Pomeranz
C:\temp\SANS2011\Skodo
C:\temp\SANS2011\Skodo
C:\temp\Shmoo2011\Coyne
C:\temp\Shmoo2011\Coyne


We then remove duplicates like this:

... | Get-Unique -AsString

Directory
---------
C:\temp\BHDC2011\Larimer
C:\temp\SANS2011\Medin
C:\temp\SANS2011\Pomeranz
C:\temp\SANS2011\Skodo
C:\temp\Shmoo2011\Coyne


Honestly, I understand that you need the -AsString switch, but I don't understand why it isn't smarter. Here is the relevant section from the help page (Get-Help Get-Unique):
"[-AsString] Treats the data as a string. Without this parameter, data is treated as an object, so when you submit a collection of objects of the same type to Get-Unique, such as a collection of files, it returns just one (the first). You can use this parameter to find the unique values of object properties, such as the file names." Ok, whatever.

We can shorten our last command by using positional parameters, aliases, and short parameter names:

C:\videos> ls -r | ? { !$_.PSISContainer } | group BaseName,PSParentPath |
? { $_.Count -ne 2 } | select @{n="Directory";e={$_.Name -replace '.*::', ''}} |
unique -a


Now back to my jitterbug.