Tuesday, January 18, 2011

Episode #130: Eenie, Meanie, Miney, Mo...

Hal gets selective

Recently I was doing an audit of USB devices that had been connected to numerous Windows machines on a network. The input data I had looked like this:

KeyStore XP2G
080909524d94e5
Sat Jun 30 22:42:07 2009
Apple iPod
000A270010C4E86E
Fri Jan 16 23:24:15 2009
M-Sys Dell Memory Key
086086412140E1C2
Fri Jan 16 23:15:30 2009
OLYMPUS u810/S810
000J55024022
Wed Jan 14 19:03:58 2009
...

The input file was a collection of three line "records" for each device. The first line was the "friendly name" of the device, the second line was the device serial number, and the third line was the date the device was last connected. I had one input file per machine.

What I needed to do was to extract just the serial numbers from each input file. Since the serial numbers of USB devices can have widely varying formats, I couldn't easily write a regular expression to match them. Instead I needed to extract the lines by line number-- the 2nd, 5th, 8th lines and so on. awk is actually really useful for this:

$ awk '!((FNR - 2) % 3)' *
080909524d94e5
000A270010C4E86E
086086412140E1C2
000J55024022
...

Recall from Davide Brini's solution a few weeks ago, that FNR is the "record number" in the current file. When using awk's default record separator which is newline, that means that FNR corresponds to the line number in the current file. So I'm subtracting two from the line number and then doing a "modulo 3" operation-- so the expression will be zero on lines 2, 5, 8, 11, ... Therefore "not" that expression, aka "!(...)", will evaluate to true on those lines only. Since there's no command block after the expression "{print}" is assumed, and I output the lines I want.

Now the reason I was pulling out the serial numbers is that I wanted to see which devices had been connected to multiple systems. The easy way to do this is to just slap on a little sort and uniq action:

$ awk '!((FNR - 2) % 3)' * | sort | uniq -d
080909524d94e5
086086412140E1C2
...

But just to save Davide the trouble of sending me an email, here's the "awk-only" version:

$ awk '!((FNR - 2) % 3) && (++a[$1] == 2)' * 
086086412140E1C2
080909524d94e5
...

By adding another clause after a logical "and" ("&&") I ensure that the second clause only gets executed on the lines containing serial numbers. In the second clause I'm creating values in an array that is indexed by the serial number. The values are a count of the number of times we've seen each serial number-- "++a[$1]" adds one to the value of a[$1] before evaluating the conditional. After we've incremented our accumulator value, we check to see if the value is 2, meaning that this is the second time we've encountered a given serial number. If that's true then our implicit "{print}" happens again and we output the serial number. I don't care if the serial number appears in more than two different files, just that it appears in at least two.

Of course the output of the "sort | uniq -d" version is in a different order than the awk-only version because of the "sort" in the first solution. But since the serial number format varies widely anyway, I'm not sure sorting is that useful in any case. You could always pipe the awk output into sort if sorting is important to you.

I'm pretty sure Tim can knock this one out of the park using Powershell. I wonder what a CMD.EXE solution would look like..?

Tim is selective too

Hal likes to taunt me with digs at CMD.EXE, and this week I'll have to concede. I got about 90% of the way through writing the big ol' command and decided it was getting ridiculous. If you want to know what it would have looked like go check out Ed's final episode. The command would have worked but it would be completely impractical and no one would have used it, in short, a circus act. Instead, let's do something practical. On to PowerShell...

The PowerShell cmdlet used to read a file is Get-Content (aliases cat, gc, and type). But we don't just need to read a file, we need to read a file and get every third line. Here's how to do just that:

PS C:\> Get-Content -Path * -ReadCount 3 | % { $_[1] }
080909524d94e5
000A270010C4E86E
086086412140E1C2
000J55024022
...


Get-Content's ReadCount parameter is used to "specifies how many lines of content are sent through the pipeline at a time." The default value is 1, so normally each line is sent down one at a time. Setting this value to 3 means that 3 lines at a time will be sent down the pipeline.

Inside the ForEach-Object's scriptblock, the current pipeline object ($_) contains the three lines passed down the pipeline. In this group of three lines we are looking for the second item in the array. The second item in the array is represented by the array index of 1. Remember, the first item in an array is 0, so the second is 1, and third is 2.

Now that we have the serial numbers, let's look for duplicates. This is really easy with the Group-Object cmdlet.

PS C:\> Get-Content -Path * -ReadCount 3 | % { $_[1] } | Group-Object -NoElement

Count Name
----- ----
1 080909524d94e5
2 000A270010C4E86E
2 086086412140E1C2
...


The NoElement switch means that the original objects are not stored, so we have just the count serial number.

In addition, we can filter for counts greater than 1.

PS C:\> Get-Content -Path * -ReadCount 3 | % { $_[1] } |
Group-Object -NoElement | Where-Object { $_.Count -gt 1 }


Count Name
----- ----
2 000A270010C4E86E
2 086086412140E1C2
...


But that is a long command, and I like to be brief, so this is the short method:

PS C:\> gc * -r 3 | % { $_[1] } | group -n | ? { $_.Count -gt 1 }


So we did what Hal did, now let's up the ante and turn the contents of these files into objects.

PS C:\> gc * -r 3 | select @{Name="Name"; Expression={$_[0]}},
@{Name="Serial";Expression={$_[1]}},@{Name="Date";Expression={$_[2]}}


Name Serial Date
---- ------ ----
KeyStore XP2G 080909524d94e5 Tue Jun 30 22:42:07 2009
Apple iPod 000A270010C4E86E Fri Jan 16 23:24:15 2009
...


The Select-Object cmdlet uses hashtables to allow you to manually create properties. Hashtables use key-value pairs. To create a property the Name is [obviously] used to set the name of the property. The Expression is used to set the value of the property. So we have objetified everything, but the date is still a string. To convert it to a Date object we can use .NET to convert the string to a native Date Object.

PS C:\> gc * -r 3 | select @{Name="Name"; Expression={$_[0]}},
@{Name="Serial";Expression={$_[1]}},
@{Name="Date";Expression={[datetime]::ParseExact($_[2], "ddd MMM dd HH:mm:ss yyyy", $null)}}


Name Serial Date
---- ------ ----
KeyStore XP2G 080909524d94e5 6/30/2009 10:42:07 PM
Apple iPod 000A270010C4E86E 1/16/2009 11:24:15 PM
M-Sys Dell Memory Key 086086412140E1C2 1/16/2009 11:15:30 PM
...


Now we can do whatever we want with it, such as exporting the results to a CSV to use in Excel, filter based on the properties, or look for duplicates as we did above.

One thing that might be useful is knowing into which computers the USB devices were plugged. Assuming the file name is descriptive we can use it to add an additional property to our objects.

PS C:\> ls | % { $file = $_.Name; gc $_ -r 3 |
select @{Name="Name"; Expression={$_[0]}},
@{Name="Serial";Expression={$_[1]}},
@{Name="Date";Expression={[datetime]::ParseExact($_[2], "ddd MMM dd HH:mm:ss yyyy", $null)}},
@{Name="File";Expression={$file}}}


Name Serial Date File
---- ------ ---- ----
KeyStore XP2G 080909524d94e5 6/30/2009 10:42:07 PM Alpha.txt
Apple iPod 000A270010C4E86E 1/16/2009 12:20:00 PM Alpha.txt
...
Apple iPod 000A270010C4E86E 1/21/2009 11:24:15 PM Bravo.txt
M-Sys Dell Memory Key 086086412140E1C2 1/16/2009 11:15:30 PM Bravo.txt
...


This command starts with Get-ChildItem (alias ls) to get each file in the directory. The files are piped into ForEach-Object where the filename is stored in a variable for use later. The rest of the command is very similar as our original command, the only difference is the input to Get-Content is specified by the current pipeline object, instead of our wildcard (*) as before.

Now we can do all sorts of filtering.

PS C:\> ... | ? { $_.Serial -eq "000A270010C4E86E" }
Name Serial Date File
---- ------ ---- ----
Apple iPod 000A270010C4E86E 1/16/2009 12:20:00 PM Alpha.txt
Apple iPod 000A270010C4E86E 1/21/2009 11:24:15 PM Bravo.txt
...


Put that in your pipeline and smoke it, Hal!