Tuesday, March 1, 2011

Episode #136: Reporting for Duty

Hal dips into the mailbag:

We got an interesting challenge from Juan Cortes this week. Juan's got a text report that looks like this:

User Report                                         Date: 02/16/2011 09:57:14
All Users Page: 1 of 27

User Name Default Login Name Default Shell Name
-> Token Serial No./ Replacement Last Login Orig. Token Type/Auth with

Karim Abdul-Jabbar kajabbar
-> 000403861445 02/12/2011 00:30:01 Key Fob/Passcode
Larry Byrd lbyrd
-> 000203863210 09/27/2010 15:28:11 Key Fob/Passcode
System Admin administrator

LaBron James ljames
-> 000303861288 02/15/2011 15:52:21 Key Fob/Passcode

User Report Date: 02/16/2011 09:57:14
All Users Page: 2 of 27

User Name Default Login Name Default Shell Name
-> Token Serial No./ Replacement Last Login Orig. Token Type/Auth with

Derek Jeter djeter

Satchel Page spage
-> 000234203706 02/16/2011 12:28:40 Key Fob/Passcode
...

So we're looking at a lot of headers and other useless text and entries that are split across two lines. Juan wanted to filter out the useless text and empty records (like Derek Jeter) and create one line records for the useful information, specifically "<name> <username> <serial#> <date>".

I can do that with a couple of lines of awk:

$ awk '/-> [0-9]/ { print n, u, $2, $3 }; 
{ u = $NF; $NF = ""; n = $0 }' report.txt

Karim Abdul-Jabbar kajabbar 000403861445 02/12/2011
Larry Byrd lbyrd 000203863210 09/27/2010
LaBron James ljames 000303861288 02/15/2011
Satchel Page spage 000234203706 02/16/2011
...


Every single line in the report is going to get processed by the second block, which sets the "u" (username) and "n" (full name) variables. The username is set to the last (whitespace-delimited) field on the line, which is correct on the lines where the user's full name and username exist. We then null out the last field and set the full name variable to be the remainder of the line.

Now obviously the values in these variables are going to be erroneous on most of the lines in the report, but that doesn't matter because we're only outputting n and u in the first block of awk. And that block is only triggered on lines that match "-> followed by space and a digit" ("/-> [0-9]/")-- i.e., the second of the two lines in each user record. In this case we will have just processed a line that contains the user's full name and username, so n and u will be set appropriately. All we have to do at this point is select the values we care about from the second line of each record and output everything on one line. Matching on the "->" also allows us to easily eliminate the empty records than don't have serial number and date information. Note that if you want things to line up in nice, pretty columns you could use printf instead of print here.

This "accumulate values and output on trigger" approach is very useful when you're collecting data fields that span multiple lines. You can use this idea when processing XML files, Windows *.ini files, and many other file formats.

Let's see what kind of PowerShell trickery Tim has up his sleeve this week...

Tim puts the mailbag in the dip:

This is a cool little challenge, but you silly Linux guys always want text parsing. I'll follow along and give you text results, but you guys need to up your shell so you can use objects. Text parsing it is...

We first need to find the relevant lines in the report, and we can do that with the PowerShell equivalent of grep, Select-String.

PS C:\> Select-String -Path report.txt -Pattern '-> \d' -Context 1,0

report.txt:7:Karim Abdul-Jabbar kajabbar
> report.txt:8:-> 000403861445 02/12/2011 00:30:01 Key Fob/Passcode
report.txt:9:Larry Byrd lbyrd
> report.txt:10:-> 000203863210 09/27/2010 15:28:11 Key Fob/Passcode
report.txt:13:LaBron James ljames
> report.txt:14:-> 000303861288 02/15/2011 15:52:21 Key Fob/Passcode
report.txt:24:Satchel Page spage
> report.txt:25:-> 000234203706 02/16/2011 12:28:40 Key Fob/Passcode


Similar to what Hal's search, our search string looks for '->' followed by a number (\d = digit). The Context parameter is used to grab the line before our match so we can get the name in addition to the token information.

The Context parameter takes one or two numbers. If we give it one number, this is the number of lines captured before AND after the match. If we specify two numbers, the first specifies how many lines before the match to capture, the second specifies how many lines after the match to capture. We just want the line before the match and that is why we specified 1,0.

The default output of Select-String shows us which file and line number contained the match. The greater than sign designates which line contained the match so it is easy to see which lines match and which are context.

We have the lines we want, but now we need to figure out what to do with them. When I got to this point, I wasn't sure where to go next, so I used Get-Member (gm) to figure out what properties and methods were available for the output object. So let's see what's available.

PS C:\> Select-String -Path report.txt -Pattern '-> \d' -Context 1,0 | gm


TypeName: Microsoft.PowerShell.Commands.MatchInfo

Name MemberType Definition
---- ---------- ----------
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
RelativePath Method string RelativePath(string directory)
ToString Method string ToString(), string ToString(string directory)
Context Property Microsoft.PowerShell.Commands.MatchInfoContext Context {get;set;}
Filename Property System.String Filename {get;}
IgnoreCase Property System.Boolean IgnoreCase {get;set;}
Line Property System.String Line {get;set;}
LineNumber Property System.Int32 LineNumber {get;set;}
Matches Property System.Text.RegularExpressions.Match[] Matches {get;set;}
Path Property System.String Path {get;set;}
Pattern Property System.String Pattern {get;set;}


The Context property is what we are interested in, so let's look at that further.

PS C:\> Select-String -Path report.txt -Pattern '-> \d' -Context 1,0 | % { $_.Context } | gm

TypeName: Microsoft.PowerShell.Commands.MatchInfoContext

Name MemberType Definition
---- ---------- ----------
Clone Method System.Object Clone()
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
DisplayPostContext Property System.String[] DisplayPostContext {get;set;}
DisplayPreContext Property System.String[] DisplayPreContext {get;set;}
PostContext Property System.String[] PostContext {get;set;}
PreContext Property System.String[] PreContext {get;set;}


The PreContext property is an array of strings which represents multiple lines of PreContext. In this case we just have one, so we can access it via index 0.

Next, let's juice up our search filter. If we use regular expression groups in our search we can access the groups via the $matches object.

PS C:\> Select-String -Path report.txt -Pattern '\d{12} \d{2}/\d{2}/\d{4}' -Context 1,0


The $matches object will now contain an object representing the token and the date. The $matches object can contain an array of match objects if there is more than one group, but in our new search there is only one (index 0).

We now have all the pieces, all we have to do is put them together:

PS C:\> Select-String -Path report.txt -Pattern '\d{12} \d{2}/\d{2}/\d{4}' -Context 1,0 |
% { Write-Host $_.Context.PreContext[0] $_.Matches[0] }

Karim Abdul-Jabbar kajabbar 000403861445 02/12/2011
Larry Byrd lbyrd 000203863210 09/27/2010
LaBron James ljames 000303861288 02/15/2011
Satchel Page spage 000234203706 02/16/2011


We just use Write-Host to output the data, but of course we could send the output to a file with the Out-File cmdlet.

A quick side note: If you don't understand the Regular Expressions used here, I highly recommend you start reading about them as they are powerful and very useful. Even a basic understanding of them will help a lot when it comes to automating tasks and parsing text.