Command Line Kung Fu: Episode #99: The .needle in the /haystack

Tim is on the road:

This week I'm at the SANS Penetration Testing & Vulnerability Assessment Summit hanging out with Ed. And no, I don't get any money for saying that. Although, Ed did give me some money to stay away from him. Come to think of it, Hal did the same thing before. It must be that they just can't stand being next to the most handsome one of the trio, and it has nothing to do with my love of onions, garlic, and German Brick Cheese*.

Back in the regular world, I had a bunch of files to review and search, but I didn't have any idea what types of files were in the mix. I whipped up a quick some PowerShell to give me a quick overview of the file types in the directory tree. Once I knew what type of files I'm was dealing with, I was better able to pick the tool to review the documents. Here is the command:

PS C:\> ls mydir -Recurse | ? { -not $_.PSIsContainer } | group Extension -NoElement | sort count -desc

Count Name
----- ----
145 .pdf
 19 .rtf
 16 .doc
  7 .xml
  7 .docx
  4 .xls
  1
  1 .xlsx

We start off by getting a recursive directory listing. The Where-Object cmdlet (alias ?) is used to remove directories from the listing. The PSIsContainer is a good way to differentiate files from folders, directories are containers and files aren't. Next, we then use Group-Object (alias group) to group based on file extension. The NoElement switch tells the Group-Object cmdlet not to include in the output the collection of all the file objects. Finally, we sort, in descending order, based on the count in each group. By the way, any parameter or switch name can be shortened as long as it is not ambiguous. We could use Des, but not D or De since it would match Descending and Debug.

I have to say, I have a bit of envy for the Linux "file" command. Although, since Windows relies so heavily on the file extension it typically works well unless someone is trying to hide something.

Let's see what Ed and Hal have cooking?

*Warning: Never try German Brick Cheese, it tastes like sewage smells. It makes Limburger smell like flowers. Seriously, don't try it. I bought some in college as a joke and we could smell it through the Ziploc back in the fridge. Bleh! Oh, and sorry to Eric and Kevin for tricking you into trying it.

Ed's On the Road Too
So, like, when Tim initially proposed this article, he was all like, “Yeah, just count the number of files of a given file suffix on a partition. This will be hard for Ed.” And, I was like, “Uh… Dude… as if. I mean, just totally do this:

C:\> dir /b /s C:\*.ini | find /c /v “”

And you’ll have the total number of ini files. Lather, rinse, and repeat for any other suffix, ya know.”

Tim responded, “Oh yeah. Never mind. I’ll write my part first.”

AND THEN… my esteemed colleague unfurls something that automatically figures out which extensions are on the partition and creates a beautiful summary of their counts. I’m not saying that he set me up. But, well, come to think of it, I think he set me up. It’s a good thing that I’m not very busy this week, or else that would have been a problem.

Well, Mr. Medin, if that is your real name, my German-brick-cheese-eating friend, put this in your cmd.exe pipe and smoke it:

C:\> cmd.exe /v:on /c "set directory=c:\windows\system32& (for /f "delims=" %i in 
     ('dir /a-D /L /s /b !directory!') do @set name=%i& echo !name:~-4,4! >> 
     c:\suffix.txt) & sort c:\suffix.txt > c:\sortsuf.txt & (set previous= & for /f 
     %j in (c:\sortsuf.txt) do @set current=%j& if NOT !current!==!previous! (echo 
     %j >> c:\uniqsuf.txt) & set previous=!current!) & for /f %k in (c:\uniqsuf.txt) 
     do @echo %k: & dir /b /s !directory!\*%k | find /c /v """ & del c:\suffix.txt 
     c:\sortsuf.txt c:\uniqsuf.txt
.acm:
10
.acs:
1
.bak:
2
.bat:
1
.bin:
4
.bmp:
1
.btr:
1
.bud:
2
---SNIP---

To make this whole shebang go, all ya have to do is put the appropriate directory in the “set directory =” part. Note that there is no space after the directory name and the &, which is important. When I first showed that command to Tim, he responded, “That, is art. Not-so-much Rembrandt as Salvador Dali.” You know, I’ve been considering growing one of those Dali-style mustaches.

As for the command itself, I think this is all pretty self-explanatory. Right? I mean, it just kinds rolls of the fingers and is the obvious way to do this. Easy as pie.

Well, if you insist, I’ll give you a synopsis of what’s going on here followed by the details. My command can be broken into three phases, along with a preamble up front and a clean-up action at the end. First, I isolate the last four characters of each file name (which should be the suffix, letting me catch stuff like “.xls” and “xlsx”), storing the result in a file called c:\suffix.txt. In the second phase, I jump into my uniquifier (virtually identical to the uniq command I implemented in Episode #91) which sorts the c:\suffix.txt file and plucks out the unique entries. And, thirdly, I then go through each of my unique suffixes and count the number of files that have each given suffix. There is a bit of a down side. If a file doesn’t have a three-or-four-character suffix, my command will give a “File Not Found” message, but that’s not so bad.

Those are the highlights of what I’ve wrought. Let’s jump into the details. In my preamble, I start by invoking delayed variable expansion (cmd.exe /v:on /c), because I’m gonna have a metric spit-ton of variables whose value will have to float as my command runs. Next, I set a variable called directory to the directory we’re going to count in. I didn’t have to do it this way, but without it, our dear user would have to type in the directory name a couple of different places. We care about human factors and ease of use here at CommandLineKungFuBlog. The directory name is immediately followed by an &, without a space. That way, it won’t pick up an extra space at the end in the variable value itself.

With that preliminary planning done, I move into Phase 1. I have a FOR /F loop, with default parsing on spaces and tabs turned off (“delims=”) and an iterator variable of %i. I’m iterating over the output of a dir command, with options set to not show directories (/a-D), with all file names in lower case (/L). I’ve gotta use the lowercase option here, or else we’d have separate counts for .doc, .DOC, .Doc, and so on. I want to recurse subdirectories (/s) and have the bare form of output (/b) so that I just get full file paths. And, of course, I want to do all of this for !directory!, using the !’s instead of %’s for the variable because I want the delayed expanded value of that sucker. In the body of my FOR loop, for each file, I take its name out of the iterator variable (%i) and stick it into the variable “name” so I can do substring operations on it (you can’t do substring operations on iterator variables, so you have to slide them into a regular variable). I then drop the last four characters of the name (!name:~-4,4!, which is a substring specifying an offset into the string of -4, for a substring of 4 characters long) into a temporary file called c:\suffix.txt. I’ve not snagged all of my file suffixes.

In Phase 2, I make a list of unique suffixes, again using the technique I described in detail in Episode 91. I start by sorting my suffix list into a separate temporary file (sort c:\suffix.txt > c:\sortsuf.txt). I then create a variable called previous, which I set to a space just to get started (set previous= ). I then have a FOR /F loop, which iterates over my sorted suffixes using an iterator variable of %j: “for /f %j in (c:\sortsuf.txt)”. In the body of my do loop, I store the current suffix (%j) in a value called current so I can do compares against it. You can’t do compares of iterator variable values, so I’ve gotta tuck %j into the “current” variable. Using an IF statement, I then check to see if my current value is NOT equal to my previous value (if NOT !current!==!previous!). If it isn’t, it means this suffix is unique, so I drop it into a third temporary file, called C:\uniqsuf.txt). I then set my new previous to my current value (set previous=!current!), and iterate. Phase 2 is now done, and I have a list of unique file suffixes.

Finally, in Phase 3, I simply invoke my third FOR /F loop, with an iterator variable of %k, iterating over the contents of my uniqsuf.txt file. For each unique suffix, the do clause of this loop first echo’s the suffix name followed by a colon (echo %k: ). Then, I run something very similar to my original plan for this episode. It’s a dir /b /s command to get a bare form of output (1 line per file), recursing subdirectories, looking for files with the name of *%k. I pipe that output into a little line counter I’ve used in tons of episodes (find /c /v “”), which counts (/c) the number of lines that do not have (/v) nothing (“”). The number of lines that do not have nothing is the number of lines. The output of the find command is displayed on the screen.

After this finishes, I’ve got some clean-up to do. I use the del command to remove the three temporary files I’ve created (c:\suffix.txt, c:\sortsuf.txt, and c:\uniqsuf.txt). And, voila! I’m done.

See, I told you it was straight forward!

For once Hal isn't on the road

While Tim and Ed are whooping it up in Baltimore, I'm relaxing here in the Fortress of Solitude. They're killing brain cells partying it up with all the hot InfoSec pros, while I'm curled up with my Unix command-line to keep me company. No sir, I sure don't envy them one bit.

Since Tim mentions the file command, I suppose I better discuss why I didn't use it for this week's challenge. The problem with file for this case is that the program is almost too smart:

$ file index.html 01_before_pass.JPG Changelog.xls For508.3_4.*
index.html: HTML document text
01_before_pass.JPG: JPEG image data, JFIF standard 1.01
Changelog.xls: CDF V2 Document, Little Endian, Os: Windows, Version 6.0, Code page: 1252,
Author: Kimie Reuarin, Last Saved By: Robin, Name of Creating Application: Microsoft Excel,
Last Printed: Wed Aug 13 21:22:28 2003, Create Time/Date: Mon Aug 11 00:16:07 2003,
Last Saved Time/Date: Wed Jan  6 00:27:30 2010, Security: 0
For508.3_4.pptx: Zip archive data, at least v2.0 to extract
For508.3_4.pdf: PDF document, version 1.6

The output of file gives me a tremendous amount of information about each type of file. However, the output is so irregular that it would be difficult to sort all of the similar file types together.

So I'm going with file extensions, just like the Windows guys. First, you can easily look for a specific extension just by using find:

$ find ~/Documents -type f -name \*.jpg
/home/hal/Documents/My Pictures/IMAGE_00004.jpg
/home/hal/Documents/My Pictures/kathy-web-cropped.jpg
/home/hal/Documents/My Pictures/hal-headshot.jpg
[...]

Here I'm finding all regular files ("-type f") whose name matches "*.jpg". Since the "*" is a special character, it needs to be backwhacked to protect it from being interpolated by the shell (I could have used quotes here instead if I had wanted).

Of course, some of my JPEG files might be named ".jpeg" or even ".JPG" or ".JPEG", so perhaps some egrep is in order:

$ find ~/Documents -type f | egrep -i '\.jpe?g$'
[...]

But the real challenge here is to enumerate all of the file extensions under a given directory. I'm able to extract the extensions using a little sed fu:

$ find ~/Documents -type f | sed 's/.*\.\([^\/]*\)$/\1/'
pdf
pdf
pdf
doc
gif
db
[...]
/home/hal/Documents/Manuals/aaa14612
[...]

The first part of the sed regex, ".*\.", matches everything up to the last dot in the pathname because the "*" operator is "greedy" and will consume as many characters as possible while still allowing the regex to match. Then the remainder, "$[^\/.]*$$", matches all non-slash characters up to the end of the line. I specifically wrote the expression this way so I wouldn't match things like "/fee/fie/fo.fum/filename". We use the sed substitution operator here ("s/.../\1/") to replace the file name we get as input with the extension that we matched in the "$...$" grouping operator.

The only problem is that some of my files don't have extensions or any dot at all in the file name. In this case, the regex doesn't match and the substitution doesn't happen. So you just get the full, unaltered file path as output as you see above. So what I'm going to do is add another sed expression that simply changes any file names containing "/" to just be "other":

$ find ~/Documents -type f | sed 's/.*\.\([^\/]*\)$/\1/; s/.*\/.*/other/'
pdf
pdf
pdf
doc
gif
db
[...]
other
[...]

At this point, getting the summary by file extension is just a matter of a little sort and uniq action:

$ find ~/Documents -type f | sed 's/.*\.\([^\/]*\)$/\1/; s/.*\/.*/other/' | \
  sort | uniq -c | sort -nr
 1156 jpg
  877 ppt
  629 doc
  315 html
  213 other
[...]
   56 JPG
[...]
    6 html~
[...]
    1 html?openidserver=1
[...]

Here I'm using the first sort to group all the extensions togther, then counting them with "uniq -c", and finally doing a reverse numeric sort of the counts ("sort -nr") to get a nice listing.

As you can see, however, there are a few problems in the output. First, I'm counting "jpg" and "JPG" files separately, when they should probably be counted as the same. Also, there are some files extensions with funny trailing characters that should probably be filtered off. The fix for the first problem is to just use tr to fold everything to lowercase before processing. Fixing the second problem can be done by adjusting our first sed expression a bit:

$ find ~/Documents -type f | tr A-Z a-z | \
  sed 's/.*\.\([a-z0-9]*\)[^\/]*$/\1/; s/.*\/.*/other/' | \
  sort | uniq -c | sort -nr
 1212 jpg
  878 ppt
  631 doc
  322 html
  213 other
[...]

Now inside of the "$...$" grouping in my sed expression I'm explicitly only matching alphanumeric characters (I only have to match lower-case letters here because tr has already shifted all the upper-case characters to lower-case). Everything else after the alphanumeric characters just gets thrown away. Note that when I'm matching "everything else", I'm still being careful to only match non-slash characters.

I realize the sed expressions end up looking pretty gnarly here. But it's really not that difficult if you build them up in pieces. Other than that, the solution is nice and straightforward, and uses idioms that we've seen in plenty of other Episodes.

For those of you who don't like all the sed-ness here, loyal reader Jeff Haemer suggests the following alternate solution:

$ find ~/Documents -type f | while read f; do echo ${f##*/*.}; done | grep -v / |
sort | uniq -c | sort -nr
1156 jpg
877 ppt
629 doc
321 html
147 pdf
[...]

The trick here is the "${f##*/*.}" construct, which strips the matching shell glob out of the value of the variable "$f". The "##" in the middle of the expression means "match as much as possible", so that basically emulates the greedy "maximal matching" behavior that we were relying on in our sed example.

You'll notice that Jeff's example doesn't do the fancy mapping to "other" for files that don't have an extension. Here he's just using "grep -v" to filter out any pathnames that end up still having a slash in them. We could use a little sed to fix that up:

$ find ~/Documents -type f | while read f; do echo ${f##*/*.}; done |
sed 's/.*\/.*/other/' | sort | uniq -c | sort -nr
1156 jpg
877 ppt
629 doc
321 html
214 other
[...]

Jeff's code also doesn't deal with the "funny trailing characters" issue, but that's not a huge deal here. Nice work, Jeff!

Command Line Kung Fu

Tuesday, June 15, 2010

Episode #99: The .needle in the /haystack

Pages

Contact us

Blog Archive

Followers

Contributors