Tuesday, November 29, 2011

Episode #163: Pilgrim's Progress

Tim checks the metamail:

I hope everyone had a good Thanksgiving. I know I did, and I sure have a lot to be thankful for.

Today we receive mail about mail. Ed writes in about Rob VandenBrink writing in:

Gents,

Rob VandenBrink sent me a cool idea this morning. It's for printing out a text-based progress indicator in cmd.exe. The idea is that if you have a loop that's doing a bunch of stuff, without any indication to the user, you can just echo a dot on the screen at each iteration of the loop to show that you are still alive and have processed another iteration. The issue in cmd.exe is echoing the dot without a CRLF, so that it goes nicely across the screen. Here's Rob's approach (which uses set /p very cleverly to define a variable, but without using that variable). On a few episodes, I used set /a because of its nice property of doing math without a CRLF. Here, Rob uses set /p to avoid the CRLF.

C:\> for /L %i in (1,1,5) do @timeout 1 >nul & <nul (set /p z=.)
..... <-- Progress Dots


Just replace the timeout command with something useful, and vary the FOR iterator loop to something that makes sense.

Worthy of an episode?


Well Ed, you can tell Rob that it is. Rob, feel free to send more cool suggestions to Ed and Ed can send them to us. I'll pass along what I think is worthy of an episode to Hal, so Rob talk to Ed, Ed talk to Tim, Tim talks to Hal, Hal responds to Tim, Tim responds to Ed, Ed responds to Rob. Unless Ed is unavailable, then we Rob should find Tim who will check for Ed, then...

Ok, we'll work out those details later. We certainly need to keep a strict flow of information, or else it could confusing, and we would hate that.

The trick with this command is using the /P switch with the Set command. The /P switch is used to prompt a user for input. The standard syntax looks like this:

SET /P variable=[promptString]


We are using a dummy variable Z to receive input, and the promptString is our dot. We feed NUL into the Set command so it doesn't hang while waiting for input. Since we didn't provide a carriage return the prompt is not advanced to the next line. That way we can output multiple dots on the same line. To prevent extra spaces between the dots we need to make sure there are no spaces between the dot and the next character, whether it be a closing parenthesis or a input redirection (<).

I typically write it a little differently so it is a little clearer that the NUL is being fed into Set, but the effect is the same.

C:\> (set /P z=.<NUL) & (set /P z=.<NUL)
..


PowerShell

One of the best practices of PowerShell is to write each command so the output can be used as input to another command. This means that dots would mess up our nice object-type output. That's no skin off our back, as we have a cmdlet to keep track of progress for us, Write-Progress. However, it does require a bit of knowledge as to the number of iterations we will go through. Not usually a big deal though, but it may require that some input be preloaded so this calculation can be preformed. There are all sorts of cool things we can do with this cmdlet. Examples of coolness include: multiple progress bars, displaying time remaining, and displaying extra information on the current operation.

 Test
Working
[ooooooooooooooooooooooooooooooooooooo ]

PS C:\> 1..100 | % { sleep 1; Write-Progress -Activity Test -Status Working -PercentComplete $_ }


The Activity and Status parameters are used to provide additional information. The Activity is usually used to provide a high level description of the process and the Status switch is used to describe the current operation, assuming there are multiple. Similar to our the cmd.exe command, replace the sleep 1 with something useful.

The SecondsRemaining parameter can be used to display the estimated time remaining. This time must be calculated by the author of the script or command, and since these calculations are never close to correct, I personally refuse to ever even try to calculate the remaining time. So enough of my rant and back to the task at hand.

Multiple progress bars can be used by using a unique ID for each. The default ID is 0, so we can use 1 for the second progress bar.

 Testing
Outer
[oooooooooooooooooo ]
Testing
Inner
[oooooooooooooooooooooooooooooooooooooooooooooooooooooo ]

PS C:\> 1..100 | % { Write-Progress -Activity Testing -Status Outer -PercentComplete $_;
1..100 | % { sleep 0.5; Write-Progress -Activity Testing -Status Inner -PercentComplete $_ -ID 1 }
}


Another bonus is that the progress bar is displayed at the top of the screen so it doesn't interfere with the most recent output. To make it even better, it disappears after the command has completed. We have a progress display and we don't have any messy output to clean up, awesome!

The cmd.exe output is functional, but not great, and the PowerShell version is really pretty. My bet is Hal and his *nix fu is going to snuggle up between these two. Hal, snuggle away.

Hal emerges from a food coma

Don't even get me started about "snuggling" with Tim. Mostly he just rolls over and goes to sleep, leaving me with the aftermath. He never cares about my feelings or what's important to me...

Oh, sorry. I forgot we were talking command line kung fu here. I'm not going to be getting much "snuggling" on that front either, as it turns out. The Linux options pretty much emulate the two choices that Tim presented on the Windows side.

The portable method uses shell built-ins and looks a lot like the CMD.EXE solution. Here's an example using the while loop from last week's Episode:

paste <(awk '{print; print; print; print}' users.txt) passwords.txt |
while read u p; do
mount -t cifs -o domain=mydomain,user=$u,password=$p \
//myshare /mnt >/dev/null 2>&1 \
&& echo $u/$p works && umount /mnt
(( $((++c)) % 100 )) || echo -n . 1>&2
done >working-passwords.txt

The code I've added in bold face is the part that prints dots to show progress. I've got a variable "$c" that gets incremented each time through the loop. We then take the value of that variable modulo 100. Every hundred iterations, that expression will have the value zero, so the echo statement after the "||" will get executed and print a dot. I use "echo -n" so we don't get a newline after the dot.

Notice also the "1>&2" after the echo expression. This causes the dots coming out of the echo command heading for the standard output to go to the standard error instead. That way I'm able to redirect the normal output of the loop-- the usernames and passwords I'm brute-forcing-- into a file using ">working-passwords.txt" at the end of the loop and still see the progress dots on the standard error.

You can slip this code into any loop you care to. And by adjusting the value on the right-hand side of the modulus operator you can cause the dots to be printed more or less frequently, depending on the size of your input. If you're reading a log file that's hundreds of thousands of lines long, you might want to do something like "... % 10000" so your screen doesn't just fill up with dots. On the other hand, you want the dots to appear frequently enough that it looks like something is happening. You just have to play around with the number until you're happy.

While this approach is very portable and easy to use, it only works inside an explicit loop. There are lots of tasks where we're processing data using a series of commands in a pipeline with no loops at all. For example, there's pipelines like the one from Episode #38:

grep 'POST /login/form' ssl_access_log* | 
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' |
sort | uniq -c |
awk '{ t = t + $1; print} END { print t " TOTAL" }'

Oh sure, I could force a loop at the front of the pipeline just to get some dots:

while read line; do
echo $line
(( $((++c)) % 10000 )) || echo -n . 1>&2
done <ssl_access_log* | grep 'POST /login/form' | ...

But let's face it, this is gross, inefficient, and silly. What bash is lacking is a built-in construct like PowerShell's Write-Progress cmdlet.

Happily, there's an Open Source utility called "pv" (pipe viewer) that kicks Write-Progress' butt through the flimsy walls of our command line dojo. Unhappily, it's not a built-in utility, so strictly speaking it's not allowed by the rules of our blog. But sometimes it's fun to bring a bazooka to a knife fight.

In it's simplest usage, pv just replaces the silly while loop that I forced onto the beginning of our pipeline:

# pv -c ssl_access_log* | 
grep 'POST /login/form' |
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' |
sort | uniq -c | awk '{ t = t + $1; print} END { print t " TOTAL" }' >counts.txt

83.4MB 0:00:05 [16.2MB/s] [=================================>] 100%

pv reads our input files and sends their content to the standard output-- just like the cat command. But it also creates a progress bar on the standard error. The "-c" option tells pv to use "curses" style cursor positioning sequences to update the progress bar more efficiently.

I'm redirecting the actual pipeline output with the browser counts into a file (">counts.txt") so it's easier to focus in on the progress bar. I've captured the output after the loop has completed, so you're seeing the 100% completion bar, but notice that the left-hand side of the bar tracks the total data read and the amount of time taken.

What's really fun, however, is using multiple instances of pv inside a complicated pipeline:

# pv -cN Input ssl_access_log* | 
grep 'POST /login/' | pv -cN grep |
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' | pv -cN sed |
sort | uniq -c | awk '{ t = t + $1; print} END { print t " TOTAL" }' >counts.txt

grep: 7.5MB 0:00:04 [1.51MB/s] [ <=> ]
sed: 259kB 0:00:05 [51.8kB/s] [ <=> ]
Input: 83.4MB 0:00:04 [17.1MB/s] [======================>] 100%

You'll notice that I've added two more pv invocations in the middle of our pipeline: one after the grep and one after the sed command. I'm also using the "-N" ("name") flag to assign a unique name to each instance of pv. This name appears in front of each progress bar so you can tell them apart.

What's fun about this mode is that it shows you how much you're reducing the data as it goes through each command. The total "Input" size is 83MB of access logs, which grep winnows down to 7.5MB of matching lines. Then sed removes everything except the browser name and major version number, leaving us with only 260KB of data.

pv is widely available in various Linux distros, though it's not typically part of the base install. There's a BSD Ports version available and it's even in the MacOS HomeBrew system. Solaris folks can find it at Sunfreeware. Everybody else gets to build it from source. But it's a useful tool in your command line toolchest.

Consider this your early Xmas present. And you didn't even have to brave the pepper spray at Wal*mart to get it.