Tuesday, June 23, 2009

Episode #48: Parse-a-Palooza

Hal takes a step back:

Over the last several weeks we've been parsing a lot of different kinds of data using all sorts of different shell idioms. Ed pointed out that it might not be obvious to you, our readers, why we might pick one tool for a certain job and then use a completely different tool for another task. So he suggested a "meta-post" where we talk about the various parsing tools available in our shells and why we might pick one over the other for certain kinds of problems. Great idea Ed!

Then he suggested that I take a whack at starting the article. Bad idea Ed! Though I'm sure Ed did, and still does, think he was brilliant twice with this. Honestly, I've been solving these problems in the shell for so long that my parsing tool choices have become unconscious (some scurrilous wags would suggest that it's always been that way), and it was actually a little difficult to come up with a coherent set of rules for choosing between cut, awk, and sed. But here goes:


  1. cut is definitely the tool of choice if you need to extract a specific range of characters from your lines of input. An example of this would be extracting the queue IDs from file names in the Sendmail message queue, as I did in Episode 12:

    grep -l spammer@example.com qf* | cut -c3- | xargs -I'{}' rm qf{} df{}

    cut is the only tool that allows you to easily pull out ranges of characters like this.


  2. cut is also useful when your input contains strongly delimited data, like when I was parsing the /etc/group file in Episode 43:

    cut -f1,4 -d: /etc/group


  3. awk, of course, is the best tool to use when you're dealing with whitespace delimited data. The canonical example of this is using awk to pull out process ID info from the output of ps:

    ps -ef | awk '{print $2}'


  4. awk also has a variety of built-in matching operators, which makes it ideal when you only want a subset of your input lines. In fact, because awk has the "-F" option to specify a different delimiter than just whitespace, there are times when I prefer to use awk instead of cut on strongly delimited input. This is typically when I only want the data from some of the input lines and not others. Paul learned this in Episode 13:

    awk -F'|' '/CVE-2008-4250/ {print $1}' | sort -u

    Remember, if you find yourself piping grep into cut, you probably should be using awk instead.


  5. I use sed when I need to parse and modify data on the fly with the flexible matching power of regular expressions. For example, there's this bit of sed fu to extract browser names from Apache access_log files that I concocted for Episode 38:

    sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' access_log*


Rules, of course, are made to be broken. YMMV, and so on. But I hope these will help all of you when you're trying to figure out the best way to solve a parsing problem in the Unix shell.

Ed jumps in:
You know, Hal, I have a confession to make, just between the two of us and the growing multitude of faithful readers of this blog. I envy you. There I said it. And, my envy is not just because of your thick, luxurious hair (which is a certain sign of a deal with the devil, if you ask me... but I digress). No, my envy is based primarily on all of the great options you and your beloved bash have for parsing text. The cmd.exe parsing options pale in comparison to the splendors of cut, awk, and sed. But, as they say, you gotta dance with the one that brung ya, so here is how I approach parsing my output in cmd.exe.

Following your lead, Hal, here are the rules, followed in order, that I apply when parsing output in cmd.exe:

  1. Starting from first principles, I see if there is a line of output of a command that already has what I want in it, and whether that line is acceptable for me to use by itself. If so, I just compose the appropriate syntax for the find or findstr commands to locate the line(s) that I'm interested in. For example, I did this in Episode #6 to create a command-line ping sweeper with this command:

    C:\> FOR /L %i in (1,1,255) do @ping -n 1 10.10.10.%i | find "Reply"

    Because the only output lines I'm looking for have the text "Reply" in them, no parsing is necessary. This is a lucky break, and in cmd.exe, we take all the lucky breaks we can. If I need regex, I use findstr instead of the find command.

  2. When the output I'm looking for won't work "as-is", and I need to change the order, extract, or otherwise alter output fields, I've gotta do some more intricate parsing. Unlike the various options bash users have, we don't really have a lot to brainstorm through with cmd.exe. Most of our parsing heavy lifting is done with the FOR /F command. This is the most flexible of all FOR commands in Windows, allowing us to parse files, strings, and the output of a given command. In fact, if you ever want to assign a shell variable with the value of all or part of the output of a command in cmd.exe, you are going to likely turn to FOR /F loops to do it, whether you want to parse the output or not.

    The syntax of a FOR /F loop that iterates over the output of [command1], running [command2] on whatever you've parsed out looks like this:

    C:\> FOR /F ["options"] %i in ('[command1]') do @[command2]

    FOR /F takes each line of output from [command1], and breaks it down, assigning portions of its output to the iterator variable(s) we specify, such as %i in this example. The "options", which must be surrounded in double quotes, are where we get to specify our parsing. One of our options is how many lines we want to skip in the output of [command1] before we want to start our parsing operation. By indicating "skip=2", we'd skip the first two lines of output, often column titles and a blank line or ------ separators.

    Once we skip those lines, FOR /F will parse each line based on additional options we specify. By default, FOR /F breaks lines down using delimiters of spaces and tabs. To use something else, such as commas and periods, you could specify "delims=,.". All such characters will be sucked out of our output, and the remaining results will be assigned to the variables we specify. If I have output text that includes one or more annoying characters that I want to go away, I'll make it a delimeter. I did this in Episode #43 to get rid of a stray * that the "net localgroup" command put in its output.

    Once we specify our skip and delims (with syntax like "skip =2 delims=,."), we then have the option of associating variables with output elements using the "tokens=" syntax. If we just want the first item that is not part of our delimiters to be assigned to our %i variable, we don't have to specify "tokens=[whatever]" at all. But, suppose we wanted the first and third elements of the output of command1 to be assigned to our iterator variables. We'd then specify "tokens=1,3". Note that we don't have to actually create any more variables beyond the initial that we specify (%i in a lot of my examples), because the FOR /F command will automatically make the additional variable associated with our tokens declaration by simply going up the alphabet. That is, if we specify FOR /F "tokens=1,3" %i ... the first component of our output other than our delimiters will be assigned to variable %i, and the third component will be assigned to %j. The FOR /F loop auto creates %j for us. We can also specify ranges of tokens with "tokens=1-5", which will automatically assign our first component of output to our first variable (such as %i) and auto create %j, %k, and so on for us, up to five variables in this example. And, finally, if you want the first element of your output dumped into your first variable, and everything else, including delimiters, dumped into a second variable, you could specify "tokens=1,*" as I did in Episode #26.

    Whew! That's an ugly form of parsing, but it usually does what we need.

  3. But not always... sometimes, we need one extra kicker -- the ability to do substring operations. Our cmd.exe shell supports creating and displaying substrings from variables, using the syntax %var:~N,M%. This causes cmd.exe to start displaying var at an offset of N characters, printing out M characters on the screen. We start counting offsets at zero, as any rational person would do. They are offsets, after all, so the first character is at offset zero. Let's look at some examples using a handy predefined environment variable: %date%:

    C:\> echo %date%
    Mon 06/22/2009

    If we want only the day of the week, we could start at the zero position and print out three characters like this:
    C:\> echo %date:~0,3%
    Mon

    If you want the year, you could start at the end and go back four characters with a -4 as our offset into the variable:
    C:\> echo %date:~-4,4%
    2009

    If you only put in the start character, it'll display your variable from that offset all the way through the end:
    C:\> echo %date:~4%
    06/22/2009
    C:\> echo %date:~-4%
    2009
    Now, there's one more little twist here. When doing heavy parsing, we often want to perform substring operations on a variable that we've parsed out of a command using a FOR /F loop. For example, suppose that I want to run the command "dir c:\temp" in a FOR /F loop so I can parse each line of its output, and I want to select the time field. But let's also assume that I only want the single minutes digit of the time field. In other words, when dir c:\ shows this:

    C:\> dir c:\temp
    Volume in drive C has no label.
    Volume Serial Number is 442A-03DE

    Directory of C:\temp

    05/19/2009 10:16 AM .
    05/19/2009 10:16 AM ..
    05/19/2009 10:02 AM 22 stuff1.txt
    05/19/2009 10:02 AM 22 stuff1.txt.bak
    05/19/2009 10:03 AM 43 stuff2.txt
    05/19/2009 10:03 AM 43 stuff2.txt.bak
    4 File(s) 130 bytes
    2 Dir(s) 17,475,887,104 bytes free

    What's I'd really like to see is:
    6
    6
    2
    2
    3
    3

    Why would I want this? I have no idea. Perhaps I just like to parse for parsing's sake. OK?

    Anyway, you might try the following:

    C:\> FOR /F "skip=4 tokens=2" %i in ('dir c:\temp') do @echo %i:~-1,1%

    That doesn't work, because we can only do substring operations on environment variables, not the iterator variables of a FOR loop. You'll just get the ugly ":~-1,1%" displayed after each timestamp, because %i is expanded to its value without any substring operation taking effect. OK, you might then reason... I'll just save away %i in an environment variable called a, and then perform substring operations on that, as follows:
    C:\> FOR /F "skip=4 tokens=2" %i in ('dir c:\temp') do @set a=%i & echo %a:~-1,1%

    No love there either. You'll just see either a hideous "%a:~-1,1%" displayed on your output if the environment variable a isn't already set. Or, if a was already set before this command ran, you'd see the last character of what it was already set to, a constant value on your screen.

    The culprit here is that cmd.exe by default does immediate environment variable expansion, so that your echo command is immediately expanding %a to its value right when you hit ENTER. It never changes while the command is running, because its value is fixed at the time the command was started. We want %a's value to change at each iteration through the loop, so we need delayed environment variable expansion for our command. To achieve this, we can launch a new cmd.exe, with the /v:on option to perform delayed environment variable expansion, and the /c option to make it run our FOR command. Oh, and when we do this, we have to refer to all variables whose expansion we want delayed as !var! instead of %var%. The result is:

    C:\> cmd.exe /v:on /c "FOR /F "skip=4 tokens=2" %i in ('dir c:\temp') do
    @set a=%i & echo !a:~4,1!"

    That delayed environment variable expansion is annoying, but I periodically resort to it, as you've seen in Episodes #12 and #46.
Using these different options of FOR /F loops and substring operations, we can have fairly flexible parsing. It's not as easy as the parsing options offered in bash, but we can usually make do.

Ed finishes by musing about PowerShell a bit:
So, we've seen Hal's parsing tips in bash, along with my approach to parsing in cmd.exe. In my adventures with PowerShell, I've noticed something quite interesting (at least for an old shell guy like me). I almost never need to parse!

In most shells, such as bash and cmd.exe, we have streams of output and/or error from one command that we have to format properly so another command can handle it as input. In PowerShell, the pipeline between cmdlets carries objects, often with several dozen properties and methods. We shuffle these objects and their associated stuff from cmdlet to cmdlet via pipes, selecting or refining whole objects along the way. We usually don't need to parse before we pipe, because we want the entire objects to move through the pipeline, with all their glorious properties and methods. If we want output, we can almost always display exactly the properties we want, in the order we want, and typically in a format that we want. It's very cool and efficient. Dare I say it? Could it be I'm falling in love with PowerShell?

Anyway, it occurred to me that when it comes to the need for parsing, cmd.exe and bash are far more similar to each other than either is to PowerShell.

Just sayin'.