Tuesday, January 25, 2011

Episode #131: Subject of Attachment

Because Hal sed So:

We got an interesting challenge in the mailbag this week from Ryan Tuthill:

Given the example logs:

562C938C3.A9069|attachment = Interview Results.xls
B1CE33BED.A58D0|subj="Who?" | subj="Who?"
67BD53BED.AF311|subj="January 14.docx.doc" | attachment = January 14.docx.doc
19B4D27D2.A8CE7|subj="FW: Two New Bud Light Commercial's" | attachment = Bud_Lite_Pick_Up_Line.mpg


What I am curious to discover is a way to print only lines that contain subj and attachment, not just one or the other. From there, I would like to print the lines with the same subj and attachment titles.


Looking at the problem, I thought to myself, "Hmmm, irregular pattern matching. Seems like a job for sed." Then I started wondering if I could solve the problem entirely with sed. We haven't spent much time on the blog talking about sed, but you must understand that it's a fully capable programming language in its own right. For proof of that assertion, I refer you to sedtris-- yep, that's a Tetris game written entirely in sed.

Anyway, it turns out there is a sed-only solution to Ryan's problem. Assuming we've got our sample input in a file called "input", simply do this:

$ sed 'h; s/.*subj="\(.*\)".*attachment = \1$/\1/; t hit; d; :hit g' input
67BD53BED.AF311|subj="January 14.docx.doc" | attachment = January 14.docx.doc

I know the syntax is terse and mysterious here, so let's take this slowly. The first thing you need to understand is that sed has two different data spaces you can work with. The one you normally use is the "pattern space", which is the line you just read in. When you do an operation like "s/.../.../", you're operating on the pattern space. However, sed also has a "hold space" where you can store stuff for later. And there are operators that let you copy or append data into the hold space from the pattern space and vice-versa.

With that in mind, let's walk through our sed "program". The first command we give sed is "h", which means "copy the pattern space into the hold space, overwriting the previous value in the hold space". We're using this to store the original version of the line we just read in, so that we can print it out later if it matches our criteria.

Once we've saved a copy of the original line, we can start mangling the copy in the pattern space. The next sed operation we perform is a substitution. The regular expression on the lefthand side matches 'subj="<string>"' anywhere on the line, followed by 'attachment = <string>' at the end of the line. sed is one of the few Unix regular expression syntaxes that supports matching a string early in the regular expression and then testing for the matched string later in the same regex. If our pattern matches, then we end up replacing the entire original line with just the value of the <string> we matched.

This is where it starts to get tricky. It turns out that the replacement we do on the RHS of the "s/.../.../" expression doesn't matter. I'm really only using the substitution to verify that we have a line that matches our criteria. This, in turn, allows me to use sed's branching operator to control whether or not we output the original line. You see sed has a goto-like operator "t <label>" which will jump to the specified <label> if and only if there has been a successful "s/.../.../" operation since the current line was read in (or since the last "t" operation if there's been more than one).

So if our substitution was successful, then this is a line with matching "subj" and "attachment" titles and we want to print out the line. In this case our "t" operation tells sed to jump to the label "hit" and start executing statements from there. If the substitution didn't work, then we just fall through to the next statement after the "t".

Thus when we don't match what happens is the sed operator "d", which simply means "discard the current pattern space and move on to the next line". Think of it like "continue" or "next" in many popular programming languages. The next thing in our sed program is the label ":hit", which is where we'll jump to if the "s/.../.../" operator worked. The sed command we invoke here is "g", which overwrites the current pattern space with the contents of the hold space. Remember how we saved the original line into the hold space back at the beginning of our program? Well now we bring that back. And finally, there's an implicit "print the contents of the pattern space" at the end of every sed block which takes care of actually outputting the line for us.

Pretty neat, huh? I was feeling really good about myself until I got a follow-up email from Ryan with a few more sample log entries:

BE1342109.A1C1F|attachment = ATT49396.txt | subj="FW: snow"
21E3430E7.A8583|attachment = Scan001.pdf | subj="FW: PROJECT/TASK"
8657C30E7.AC7A5|attachment = IMG00005.jpg | subj="IMG00005.jpg"

Yep, it turns out that the "subj" and "attachment" can appear in either order. Well, back to the drawing board:

$ sed 'h; 
s/.*subj="\([^"]*\)".*attachment = \1$/\1/; t hit;
s/.*attachment = \(.*[^ ]\) *|.*subj="\1"$/\1/; t hit; d;
:hit g' input

67BD53BED.AF311|subj="January 14.docx.doc" | attachment = January 14.docx.doc
8657C30E7.AC7A5|attachment = IMG00005.jpg | subj="IMG00005.jpg"

The new code isn't actually all that different from the original solution-- I've just split things onto multiple lines for greater readability. We have the "h" operator, a substitution that checks for 'subj="<string>"' followed by 'attachment = <string>', and the "t" operator just as before. However, this time if we fail to get a hit, then we try looking for 'attachment = <string>' followed by 'subj="<string>"' instead. Only if both of those operations fail, do we give up and do "d". If either operation succeeds, then we jump to ":hit" and do the operations to print out the line.

The pattern match for the 'attachment = <string>' is a little tricky in the second case because there are no quotes around the attachment name and the attachment name can contain spaces. So we have to explicitly match "'attachment<space>=<space>", followed by "some stuff that ends in a non-space character", followed by some spaces, and then a pipe character ("|") which is our field separator. Yeesh!

Anyway, this has been your first high-speed introduction to the kinky joy that is sed. Tim, I bet you don't have anything this dirty in your Powershell arsenal!

Tim feeds his expressions prunes, so they stay regular:

PowerShell may not have sed, but it does have built-in regular expressions. These expression are way cooler than their incontinent brethren of cmd.exe.

Similar to Hal's approach, we first read the file and look for strings containing a subject. This is accomplished by piping Get-Content (alias gc) into Select-String where we filter using a regular expression.

PS C:\> gc log.txt | Select-String 'subj="(?<Subject>[^"]+)
B1CE33BED.A58D0|subj="Who?" | subj="Who?"
67BD53BED.AF311|subj="January 14.docx.doc" | attachment = January 14.docx.doc
19B4D27D2.A8CE7|subj="FW: Two New Bud Light Commercial's" | attachment = Bud_Lite_Pick_Up_Line.mpg
...


This regular expression matches on lines contain something like the following:
subj="<text that doesn't contain a double quote>

The content of the text that doesn't contain double quotes is part of a named group and contains the subject we would like to use later. The syntax for using a named group is:

(?<GroupName>regex)


Until I prepped for this episode, I didn't realize that the Select-String cmdlet will populate the $matches variable; I thought that was only done via the -Match operator. This little trick makes our command much shorter and easier to read than using the -Match operator.

As regular readers (badabing!) might remember, the $matches variable contains the groups matched by our regular expression. Since we used a named group, it makes it much easier to access the group we want by simply using $matches.GroupName. This variable can be used further down the pipeline, and we can use this little trick to get our final command.

PS C:\> cat .\log.txt | Select-String 'subj="(?<Subject>[^"]+)' | 
Select-String "attachment = $($matches.subject)"


67BD53BED.AF311|subj="January 14.docx.doc" | attachment = January 14.docx.doc
8657C30E7.AC7A5|attachment = IMG00005.jpg | subj="IMG00005.jpg"


The results from our first command are piped into Select-String which filters for strings containing "attachment = <subject name> and that gives us the output we are looking for.