Monday, June 8, 2009

Episode #46: Counting Matching Lines in Files

IMPORTANT ANNOUNCEMENTS!
As much as we enjoy beating each other silly with our command line kung fu, we are going to tweak things around here on this blog a bit. Instead of our relentless 3-times per week posting, we're going to move to a once per week posting rate here. We'll have new fu for you each Tuesday, 5 AM Eastern. That way, you'll be able to schedule a weekly date with our hearty band of shell warriors. You could set aside lunch every Tuesday with us... or Wednesday... or an hour each weekend. Or, some evening during the week, you could cook a nice meal, light up some soft candles, put on some romantic music, and then enjoy your meal spending time reading our weekly missive. Yeah! That's the ticket!

Oh, and one more little announcement. I'll be teaching my Windows Command-Line Kung Fu course via the SANS@Home system in a couple of weeks (June 22 and 23). The class runs on-line from 7 until 10 PM EDT, and I'll be your live instructor. It's a fun class that is designed to give security and sys admin people practical advice for using the Windows command-line to do their jobs better. It's not a regurgitation of what's on this blog, but instead gets really deep into useful commands such as WMIC, sc, tasklist, and much more. If you like this blog, the course is ideal for you.

The course normally costs about a thousand dollars live and $825 via @Home. But, SANS is offering a big discount for friends and readers of this blog. The course is $412.50 if you use the discount code of "Foo". Sign up at https://www.sans.org/athome/details.php?nid=19514

Thanks!
--Ed Skoudis.

And now... back to your regularly scheduled fu... Here's a fun episode for you!

Hal's back at it again:

I had another one of those counting problems come up recently, similar to our earlier Browser Count Torture Test challenge. This time my customer needed me to count the number of instances of a particular string in each of several dozen files in a directory. In my case I was looking for particular types of test cases in a software regression test suite, but this is also useful for looking for things like IP addresses in log files, vulnerabilities in assessment tool reports, etc.

For a single file, it would be easy enough to just:

$ grep TEST file1 | wc -l
11

But we want to operate over a large number of files, which means we somehow need to associate the name of the file with the output of "wc -l".

So I created a loop that does the main part of the work, and then piped the output of the loop into awk for some pretty-printing:

$ for f in *; do echo -n "$f  "; grep TEST $f | wc -l; done | \
awk '{t = t + $2; print $2 "\t" $1} END {print t "\tTOTAL"}'

11 file1
8 file2
14 file3
31 file4
12 file5
7 file6
3 file7
25 file8
19 file9
22 file10
19 file11
22 file12
10 file13
203 TOTAL

Inside the loop we're first spitting out the filename and a couple of spaces, but no newline. This means that the output of our "grep ... | wc -l" will appear on the same line, immediately following the filename and the spaces.

The only problem I had with the basic loop output was that the file names had very irregular lengths (unlike the sample output above) and it was difficult to read the "wc -l" data because it wasn't lined up neatly in a column. So I decided to do some post-processing with awk. The main part of the awk code keeps a running total of the values we've read in so far (you saw me using this idiom previously in Browser Count Torture Test). But you'll also notice that it reverses the order of the two columns and also inserts a tab to make things line up nicely ('print $2 "\t" $1'). In the "END" block we output the "TOTAL" once the entire output from the loop has been processed.

I love the fact that the shell lets me pipe the output of a loop into anther tool like awk for further processing. This lets me grind up a bunch of data from many different sources into a single stream and then operate on this stream. It's an idiom I use a lot.

Paul Chimes In:


Thats some pretty sweet command kung fu! When I first read this I immediately put it to good use, with some modifications of course. I frequently find myself needing to search through 28,000+ files and look for certain strings. My modifications are as follows:

$ for f in *; do echo -n "$f "; grep -i xss $f | wc -l; done | awk '{t = t + $2; print $2 "\t" $1} END {print t "\tTOTAL"}' | egrep -v '^0' | sort -n

I really didn't care about files that did not contain at least one occurance of my search string so I sent it to egrep with "-v" which shows me only results which do NOT contain the search term. My regular expression "^0" reads as, "only show me lines that begin with 0", which when combines with the "-v" removes all lines that begin with 0. Now, I could have used a filter with awk, but the syntax was not cooperating (i.e. awk /[regex]/ {[code]}). Then I wanted to see a sorted list so I ran it through "sort -n".

Ed retorts:
Gee, 28,000 files, Paul? Where did ya get that number? Sounds suspiciously like... I dunno... Nessus plug-ins. But, I digress.

OK, Sports Fans... Hang on to your hats, because I'm gonna match Hal's functionality here in cmd.exe, and it's gonna get ugly. Real ugly. But, when we're done, our command will do what Hal wants. And, in the process, it'll take us on an adventure through some interesting and useful features of good ol' cmd.exe, tying together a lot of fu that we've used in piece-parts in previous episodes. It's gonna all come together here and now. Let's dive in!

We start out simple enough:

C:\> find /c "TEST" * 2>nul | find /v ": 0"
---------- FILE1: 11
---------- FILE2: 8
---------- FILE3: 14

Here, I've used the /c option of the find command to count the number of lines inside of each file in my current directory that have the string "TEST". I throw away error messages (2>nul) to avoid cruft about directories in my output. I do a little more post processing by piping my output into find again, to search for lines that do not have (/v) the string ": 0" in them, because we don't want to display files that have our string in them zero times.

That's pretty close to what we want right there. So, we could call it a day and just walk away.

But, no.... we're kinda nuts around here, if you haven't noticed. We must press on to get closer to Hal's insanity.

The --------- stuff that find /c puts in our output is kinda ugly. Let's get rid of that with a little parsing courtesy of FOR /F:

C:\> for /f "delims=-" %i in ('"find /c "TEST" * 2>nul | find /v ": 0""') do @echo %i
FILE1: 11
FILE2: 8
FILE3: 14

Here, I'm using a FOR /F loop to parse the output of my previous command. I'm defining custom-parsing with a delimiter of "-" to get rid of those characters in my output.

Again, we could stop here, and be happy with ourselves. We've got most of what Hal wants, and our output is kinda pretty. Heck, our command is almost typable.

But we must press on. Hal's got totals, and we want them too. We could do this in a script, but that's kinda against our way here, as we strive to do all of our kung fu fighting in single commands. We'll need to add a little totaller routine to our above command, and that's where things are going to get a little messy.

The plan will be to run the component we have above, followed by another command that counts the total number of lines that have TEST in them and displays that total on the screen. We'll have to create a variable called total that we'll track at each iteration through our new counting loop. The result is:

C:\> (for /f "delims=-" %i in ('"find /c "TEST" * 2>nul | find /v ": 0""') do
@echo %i) & set total=0 & (for /f "tokens=3" %a in ('"find /c "TEST" * 2>nul"')
do @set /a total+=%a > nul) & echo. & cmd.exe /v:on /c echo TOTAL: !total!

FILE1: 11
FILE2: 8
FILE3: 14

TOTAL: 33

Although what I'm doing here is probably obvious to everyone except Hal and Paul (yeah, right!), please bear with me for a little explanation. You know, just for Hal and Paul.

I've taken my original command from above and surrounded it in parens (), so that it doesn't interfere with the new totaller component I'm adding. My totaller starts by setting an environment variable called total to zero (set total=0). I then add another component in parens (). These parens are very important, lest the shell get confused and blend my commands together, which would kinda stink as my FOR loops would bleed into each other and havoc would ensue.

Next, I want to get access to the line count output of my find /c command to assign it to a variable I can add to my total. In cmd.exe, if you want to take the output of a command and assign its value to a variable, you can use a FOR /F loop to iterate on the output of the command. I do that here by running FOR /F to iterate over "find /c "TEST" * 2>nul". To tell FOR /F that my command is really a command, I have to wrap it in single quotes (' '). But, because my command has special characters in it (the > in particular), I have to wrap the command in double quotes too (" "). The result is wrapped in single and double quotes (' " " '), a technique I use a lot such as in Episodes #34 and #45. My FOR /F loop is set to tokenize around the third element of output of this command, which will be the line count I'm looking for (default FOR /F parsing occurs on spaces as delimiters, and the output of ----- [filename]: [count] has the count as the third item).

Thus, %a now holds my interim line count of the occurrences of TEST for a given file. I then bump my total variable by that amount (set /a total+=%a) using the set /a command we discussed in "My Shell Does Math", Episode #25. I don't want to display the results of this addition on the output yet, so I throw them away (> nul). When my adding loop is done (note that all important close parens), I then echo a blank line (echo.).

Now for the ugly part. I want to display the value of my total variable. But, as we've discussed in previous episodes, cmd.exe does immediate variable expansion. When you run a command, your environment variables are expanded to their values right away. Thus, if I were to simply use "echo %total%" at the end here, it would display the total value that existed when I started the command, if such a value was even defined. But, we want to see the total value after our loop finishes running. For this, we need to activate delayed environment variable expansion, a trick I used in Episode #12 in a slightly different way.

So, with my total variable set by my loop, followed by an extra carriage return from echo. to make things look pretty, I then invoke another cmd.exe with /v:on, which enables delayed variable expansion. I ask that cmd.exe to run a command for me (/c), which is simply displaying the word TOTAL followed by the value !total!. But, what's with the bangs? Normal variables are expanded using %var%, not !var!. Well, when you use delayed variable expansion, you get access to the variable's value using !var!. The bangs are an artifact of delayed variable expansion.

And, for the most part, we've matched Hal's functionality. Our command reverses the file name and counts from Hal's fu, although we could go the other way if we want with some additional logic. I prefer filename first myself, so that's what we'll go with here.

And, our descent into insanity is pretty much done for now. :)