Tuesday, April 20, 2010

Episode #91: How Much Per Day?

Hal has an issue

In another instalment of "True Stories of the Shell Patrol", here's a useful little bit of shell fu I recently threw together for a customer. This customer had an application that wrote copious log files, automatically generating a new log file for each day. The trick was that if an individual log file reached 10M in size, the app would start a new log file. So some days there would be just a log file named "YYYYMMDD.log", and yet on other days there would be not just "YYYYMMDD.log" but also "YYYYMMDD-01.log", "YYYYMMDD-02.log", and so on.

The customer was interested in knowing how many bytes of logs the app had written each day. I thought about it for a minute and came up with:

$ for d in `ls | cut -c1-8 | uniq`; do
 echo -en "$d\t"
 cat $d* | wc -c
done
20100414 9110137
20100415 23232501
20100416 34485619
20100417 6052615
...

Here we're simply using cut to pull off the date strings from the front of each file name, and then using uniq to filter out the duplicates that we'll get from days where there are multiple log files. Inside the loop, we use echo to output the date string and a tab. The "-e" option gets echo to recognize "\t" as a tab, and the "-n" means "don't output a newline", so that way our byte count appears on the same line as the date string. Next we push all of the files for the given date through "wc -c" to count the number of bytes for each day. Note that "wc -c $d*" would not work, since this would give us the byte counts for the individual log files for each day.

We ended up getting interested in the top ten days by log file size. But once I had created the loop above, answering the "Top 10" question was easy:

$ for d in ... done | sort -nr -k2 | head

Just use our original loop and pipe the output to sort. Here we're doing a descending ("reversed" or "-r") numeric ("-n") sort on the second column ("-k2"). The head command pulls off the first 10 lines.

Ah, so easy for the Unix shell! CMD.EXE and Powershell? Let's see what the guys can come up with...

Tim has lots of issues

I've had to do the same thing a few times. This command can be very helpful. Here is the command we can use in PowerShell:

PS C:\> ls | group {$_.Name.Substring(0,8)} | select Name,@{Name="Size";
Expression={ ($_.group | Measure-Object Length -Sum).Sum }}

Name           Size
----        -------
20100414    9110137
20100415   23232501
20100416   34485619
20100417    6052615
...
We take the directory listing and pipe it into Group-Object where the grouping is based on the first eight characters of the Name property. To get the first eight characters we use the Substring method. The Substring method is available on any string, and the Name property is a string.

The groups are piped into Select-Object where we use a calculated property to compute the size of all the files in the group. To specify a calculated property we create a hash table using the @{} syntax. Inside the curly braces we need two elements: Name and Expression. Inside the Expression's script block we pipe the group into Measure-Object where the length property is summed. We now have a new property named Size.

While getting the size is a little goofy, retrieving the top 10 is pretty easy. All we need to do is use Sort-Object followed by Select-Object to grab the top 10.

... | sort size -Descending | select -First 10
Now what does Ed have for us?

Ed's Issues Go Far, Far Deeper

When I first saw Hal’s challenge this time around, I winced. Whenever he wants, Hal can pull “uniq” or “wc –c” out of his butt and use them. Me? Well, I don’t have the luxury of such useful commands. And, given our rule around here for only using built-in commands and features, I’ve often gotta make what I don’t have, using only my bare hands, spit, bisquick, duct tape, chewing gum, and copious time. Adhering to the old “teach a person to fish & feed him for life” adage, let me show you how I built the two piece parts (“unique” and “wc –c”) that I needed to make this one work.

For peeling off the file name’s unique components, we can use the following command:
C:\> cmd.exe /v:on /c "set previous= & for %i in (*) do @set name=%i & 
     set current=!name:~0,8!& if NOT !previous!==!current! (echo !current! & 
     set previous=!current!)"
20100414
20100415
20100416
I start here by invoking a cmd.exe with delayed variable expansion (cmd.exe /v:on /c) so that my variables can change values as my command runs. Then, I clear out a variable called “previous” (set previous= ), where I will later store the previous value of the file name component as I loop through my directory. I then invoke a FOR loop with an iterator variable of %i. Note that I just want to iterate through files in my current directory, so I just use a plain, vanilla FOR loop, withouth /L, /F, /R, or /D. I’m going to iterate through all files in my current directory (in (*)).

At each iteration through my loop, I turn off command echo (@) and use the set command to store the current file’s name (%i) in a variable called name. I have to do this, because we can’t perform substring operations on iterator variables. I then, use set again to do my substring operation, pulling in the first 8 characters of the file name by starting at offset zero and going up 8 spaces, storing the results in a variable called “current” (set current=!name:~0,8!&). It’s really important to leave no space between that ! and the &. If you put a space there, that space will show up in your current file’s name, and will cause problems later on when we need to measure total file size.

Then, I have an IF statement, checking to see if my previous name value is the same as my current name value. If they are NOT the same, it means I’ve not encountered this name previously, so it’s unique. I then simply echo it out, and set my new value of the previous to current. Then, I iterate around. The result is a unique list of substrings which are the file names. Note that this approach is dependent on the names being sorted so that similarly prefixed file names come near each other, which the FOR loop does automatically (in alphabetical order).

Now that we’ve spewed out the unique name prefixes, for the next part we have to calculate the total size of everything that starts with that prefix (roughly mimicking “wc –c”, a least focused on file sizes). My first approach to doing this involved simply having the dir command itself calculate this size, as follows:

C:\> cmd.exe /v:on /c "set previous= & for %i in (*) do @set name=%i & 
     set current=!name:~0,8!& if NOT !previous!==!current! (echo !current! & 
     dir !current!* | find "File(s)") & set previous=!current!"
20100414
              3 File(s)      2,080,000 bytes
20100415
              1 File(s)        389,120 bytes
20100416
              2 File(s)         72,704 bytes
Here, after the IF statement of my uniquifier command, I simply echo out the current prefix (echo !current!) followed by a dir command to show the directory listing of !current!*. That should be all of the files that start with that prefix. I then pipe that dir output through the find command to locate the line with the string “Files(s)” in it, because that shows the total size of everything that matched the dir wild-card search. I have to follow this up with setting my previous to my current prefix, so that my home-brewed uniq still works. I really kinda like the format of this output, as it shows the file counts plus the full size.

But, we aim to please here, matching Hal’s command output as closely as we can. To get a step closer, I’ll simply do a little parsing on the output of the dir command:
C:\> cmd.exe /v:on /c "set previous= & for %i in (*) do @set name=%i & 
     set current=!name:~0,8!& if NOT !previous!==!current! (for /f "tokens=3" 
     %a in ('dir !current!* ^| find "File(s)"') do @echo !current! %a) & 
     set previous=!current!"
20100414 2,080,000
20100415 389,120
20100416 72,704
Here, I’ve simply placed a FOR /F loop after my uniquifier IF, parsing to pull out the third item (“tokens=3”) into iterator variable %a from that dir command. In my do clause, I display the current prefix followed by the %a value, which is the total size.

I know what you are thinking… You are thinking that I’ve got commas in my sizes, and Hal doesn’t. Sigh… You always did love Hal more. Ever since we were kids, it was always “Hal, Hal, Hal.” I could never understand why you favored him, Mom. That’s exactly what I told my shrink on the couch in our last session, and… uh… never mind.

Anyway, we can get rid of the commas by taking a different approach to calculating the size. Instead of parsing through dir output, we could use a FOR loop to iterate through file names associated with our prefix, and then use %~za to represent the file size, which we’ll total up. Sounds crazy, I know, but YOU were the one that wanted to lose the commas.

So, without further adieu, I give you the command that mimics Hal’s, including his output format:
C:\> cmd.exe /v:on /c "set previous= & for %i in (*) do @set name=%i & 
     set current=!name:~0,8!& if NOT !previous!==!current! (set /a totalsize=0 >nul 
     & (for %a in (!current!*) do @set /a totalsize=!totalsize! + %~za >nul) 
     & echo !current! !totalsize!) & set previous=!current!"
20100414 2080000
20100415 389120
20100416 72704
Here, after my IF statement, I set a variable of totalsize to zero (set /a totalsize=0). I use set /a, because I want to do math here, not string assignment. The set /a command displays its result on Standard Out, which I don’t want to see now, so I throw it away (>nul). I then run a FOR loop, again one that’ll iterate through file names in my current directory. I’ll use an iterator variable of %a here. The files I want to iterate through are my current prefix, followed by a *. At each iteration through the loop, I want to do some math, setting my totalsize to it’s previous value plus %~za, which will expand into the number of bytes in the file represented by %a.

After that loop is done, I then echo my current prefix plus the totalsize of everything with that prefix.

Voila! Easy as pie, you see! As long as you’ve got enough bisquick.

Now, my sort in cmd.exe doesn't sort numerically at all, and there's no built-in way to do it. I would likely just dump my results in a file and then open them in a spreadsheet, where I'd do the sort.