Tuesday, April 5, 2011

Episode #141: Bonus Fu

Tim reads the fine print

The mail we received last week included a second part. The sender, Mr Thomas the Anonymous, asked:

Bonus question: we have random files all over the place create by Excel/Word that were exactly eight characters in length and have no file extension such as D0F5KLM3. What's the syntax for use for looking? If I do a "DIR ??????" on my root drive, it appears to be showing all directory names six characters in length and shorter, when I am only looking for those that are exactly six characters in length?

One important piece that Tom the Masked One missed is the /A:-D option to filter out results that have the directory attribute set. This will leave us with only files. Here is the resulting command with the bare format (/B) and recursive (/S) options:

C:\> dir /b /s /a:-d ??????
C:\Program Files\Wireshark\manuf
C:\Program Files\Wireshark\etc\gtk-2.0\gtkrc
C:\Program Files\Wireshark\help\toc
...


The output isn't quite what we wanted, as we are getting files that are shorter than 6 characters.

As it turns out, the "?" wildcard character represents any character...including no character. That means ?????? will match a, aa, aaa, aaaa, aaaaa, and aaaaaa. However, since you can't have "no character" at the beginning (??.txt won't match a.txt) or in the middle (a??a.txt won't match aa.txt) of a string, this special case only occurs when the question mark is at the end of the search string.

We can use our initial command to narrow down our search, but we'll need to use a more aggressive filter to get exactly what we want. FindStr's regular expression filtering to the rescue.

C:\> dir /b /s /a:-d ?????? | findstr /E
"\\[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]"


C:\Users\theawesomeone\Desktop\randomdir\D0F5KL
...


The /E option matches the string at the end of the line. The regular expression looks for a backslash (requires an escape character, which is a backslash), followed by 6 characters of A-Z and/or 0-9. Unfortunately, this command is a bit verbose since FindStr doesn't allow us to use [A-Z0-9]{6} to specify 6 A-Z and/or 0-9 characters. To do that we have to use...

PowerShell

Unlike CMD.EXE, PowerShell's version of the "?" wildcard requires a character, so it will not match 5 character file names. The first pass of our command is:

PS C:\> Get-ChildItem -Recurse -Include ??????


This command will do a recursive listing and only return objects with a 6 character name; however, the results still include directories and 6 character file names that include a dot. We'll have to do more filtering further down the pipeline. We could do it all further down the pipeline, but the earlier we do the filter the faster the command will be.

Here is our final command:

PS C:\> Get-ChildItem -Recurse -Include ?????? | 
Where-Object { -not $_.PSIsContainer -and $_.Name -match "[A-Z0-9]{6}" }


This further filters our results for objects that are not containers (files) and objects whose name match our regular expression. This regular expression looks for 6 character names that only contain A-Z and/or 0-9.

Per usual, we can shorten up this command significantly.

PS C:\> ls -r -i ?????? | 
? { !$_.PSIsContainer -and $_.Name -match "[A-Z0-9]{6}" }


Now Tom the Hunter of Office Files can clean up some space.

Looks like I scored some bonus points, and in two categories even. Hal, any scoring in *n?x land?

Hal is blue

Given that I'm on the road and only seeing my wife about once every other week, there's not much scoring going on at all. And thanks for rubbing salt into that particular wound, partner.

I'm also a little confused about the challenge. Mystery Tom starts out saying he's looking for eight character file names, and then faster than you can say "original Blade Runner theatrical release", we're suddenly looking for six character file names. Well, let's continue with six character file names because it makes our examples shorter. I'm sure you'll be able to figure out how to do eight character file names without much trouble.

In Unix-land, the "?" actually matches exactly one character. So we can do:

$ find testing/ -type f -name '??????'
[...]
testing/Mac-PropertyList-1.33/t/dict.t
testing/Mac-PropertyList-1.33/t/time.t
testing/Mac-PropertyList-1.33/t/load.t
testing/Mac-PropertyList-1.33/README
testing/Mac-PropertyList-1.33/examples/README
[...]

The only problem is that "?" matches any character, including dots and other punctuation. If we really only want to match files containing upper-case letters and numbers, then we have to use an expression like "[A-Z0-9]". And just like CMD.EXE, find won't let us do something like "[A-Z0-9]{8}". So we're left with:

$ find testing/ -type f -name '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
[...]
testing/Mac-PropertyList-1.33/README
testing/Mac-PropertyList-1.33/examples/README
[...]

But rather than doing all that typing, let's just use egrep instead:

$ find testing/ -type f | egrep '/[A-Z0-9]{6}$'
[...]
testing/Mac-PropertyList-1.33/README
testing/Mac-PropertyList-1.33/examples/README
[...]

Notice that I'm using "/" at the front of the regex, and "$" at the end to ensure that what I'm matching against is only the final file name component of each line.

So there are my two (OK, three) solutions. Score!