Wednesday, May 20, 2009

Episode #38: The Browser Count Torture Test

Hal just can't resist:

One of my customers was interested in some stats on which browsers were hitting their web site most often. My first thought was to use a Perl script to parse the user agent strings out of the Apache access_log files. But, hey, I write for Command Line Kung Fu, so how far can I get just using standard Unix tools? Besides, trying to do the same thing in the Windows shell will be a pain in the... humiliating defeat... interesting learning experience for Ed, and I'm committed to the growth of my friends.

First let me show you what I came up with, then I'll explain it in detail:

# grep 'POST /login/form' ssl_access_log* | \
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' | \
sort | uniq -c | \
awk '{ t = t + $1; print} END { print t " TOTAL" }' | \
sort -nr | \
awk '/TOTAL/ { t = $1; next }; { $1 = sprintf("%.1f%%\t", $1*100/t); print}'

46.4% Firefox/3
27.0% MSIE 7.0
14.3% Safari
5.3% MSIE 6.0
3.0% Firefox/2
2.4% -
1.2% MSIE 8.0
0.3% Firefox/1

Here's the line-by-line interpretation:


  1. I didn't want to count every single page access, but was instead more interested in counting browsers by "user session". Since the site requires a user login for access, I used posting the secure login form as a proxy for recognizing individual sessions. Close enough for jazz.

  2. Now, to pull out the browser name/version from the user agent string. The data here is annoyingly irregular, so it looks like a good task for sed. Notice that I'm using the "-r" option in GNU sed to use extended regular expression syntax: not only does this allow me to use "|" in the regexp, but it also means I don't need to backwhack my parens to create sub-expressions.

    The regular expression itself is interesting. I'm creating a sub-expression match on either "MSIE <vers>.<sub>", "Firefox/<vers>", or "Safari" (I don't find tracking Firefox sub-versions or Safari version numbers that interesting, but as always "your mileage may vary"). Anything that doesn't match one of these browser patterns ends up matching a hyphen ("-") character, which are plentiful in Apache access_log entries.

    I place ".*" before and after the sub-expression, which matches the rest of the line before and after the browser string. However, since that text is not included in the sub-expression, when I replace the matching line with the sub-expression then the rest of the text is dropped. That leaves us with an output stream of just the browser info, or "-" for lines that don't match one of the major browsers we're tracking.

  3. Now that we've got a data stream with the browser info, it's time to count it. "... | sort | uniq -c" is the common idiom for this, and we end up with output like:

        290 -
    34 Firefox/1
    363 Firefox/2
    5534 Firefox/3
    632 MSIE 6.0
    3207 MSIE 7.0
    139 MSIE 8.0
    1708 Safari

  4. The next line is a common awk idiom for totalling a column of numbers. We print out each line as it's processed, but also keep a running total in the variable "t". After all the input has been processed, we use an "END" block to output the total. Now our output looks like:

        290 -
    34 Firefox/1
    363 Firefox/2
    5534 Firefox/3
    632 MSIE 6.0
    3207 MSIE 7.0
    139 MSIE 8.0
    1708 Safari
    11907 TOTAL

  5. The next "sort -nr" not only puts our data into numerically sorted order, but also has the side-effect of moving the "TOTAL" column up to the first line of output. We're going to make use of this in the awk expression on the next line.

  6. The last awk expression is a little psychotic, so let's take it piece by piece. The first section, "/TOTAL/ { t = $1; next }", matches our initial "TOTAL" line and puts the total number of entries into the variable "t". The "next" causes awk to skip on to the next line without printing the current line ("TOTAL").

    The other portion of the awk code will handle all of the other lines in the output. What we're doing here is replacing the raw count number in the first column with a percentage. The "sprintf(...)" format string looks a little weird, but it means a floating point value with one decimal place ("%.1f"), followed by a literal percent character ("%%"), followed by a tab ("\t"). The numeric value we plug in is the raw count from column 1 of the output, times 100, divided by the "TOTAL" value we extracted from the first line of output.


And there you have it. The agonized squealing you're hearing is Ed wondering how he's going to even get close to this in the Windows shell. I can't wait to see what he comes up with.

Ed responds:
Wow! That's some serious fu there, Hal. And, I mean both serious and fu.

Thanks for the interesting learning opportunity, kind sir. How delightful!

As you know, we're kinda hampered with cmd.exe in that we get regex support from findstr, which cannot do extended regular expressions like sed -r. Therefore, we cannot do the funky "|" in the regex. Our resulting command will have to include more piece-parts for each browser.

And, as we discussed in Episode # 25: My Shell Does Math, we have access to simple integer math in cmd.exe via "set /a", but floating point and division cause problems.

Still, we can get some useful output that tells us the number of each kind of browser and a handy total like this:

C:\> echo MSIE > browser.txt & echo Firefox >> browser.txt & echo Safari
>> browser.txt & echo - >> browser.txt & (for /f %i in (browser.txt) do
@echo %i & type ssl_access_log | find "POST /login/form" | find /c "%i" & echo.)
& del browser.txt
MSIE
873

Firefox
1103

Safari
342

-
2327

In this command, I'm first building a little file called browser.txt containing the different browsers strings that I'd like to count. I'll then iterate over that file using a FOR /F loop. I'd much rather do this by iterating over a string containing "MSIE FIREFOX SAFARI -", but unfortunately, FOR /F parses strings into a series of variables all in one FOR /F iteration, making it useful for parsing a string into different variables (like %i %j %k, etc.). But, FOR /F used with a string does not pull apart a string into pieces that vary at each iteration through the loop. Boo, FOR /F! So, we compensate by building a little file with one browser per line, and then we iterate over that.

For each browser in browser.txt, we display the browser name (echo %i), and scrape through our ssl_access_log using the plain old find command to look for lines with "POST /login/form". I then take the output of that, pipe it through find with a /c option to count the number of occurrences of the %i iterator, which is the name of each browser. Note that the - will total all browsers, since their log entries have a dash in them. After my little looping escapade, I delete the temporary browser.txt file that I created at the beginning.

The output, while not as beautiful as Hal's, still is useful -- you see the number of POST login actions per browser, and the total. Why, you could even add a little "& calc.exe" at the end to pop up a calculator to do your percentages. :)