Whew! Just got done with another week of teaching, this time at SANS Baltimore. I even got a chance to give my "Return of Command Line Kung Fu" talk, so I got a bunch of shell questions.
One of my students had a very interesting challenge. To help analyze malicious PDF documents, he was trying to parse the output of Didier Stevens' pdf-parser.py and create an input file for GNUplot that would show a graph of the object references in the document. Here's a sample of the kind of output we're dealing with:
$ pdf-parser.py CLKF.pdf
PDF Comment '%PDF-1.3\n'
PDF Comment '%\xc7\xec\x8f\xa2\n'
obj 5 0
Type:
Referencing: 6 0 R
Contains stream
[(1, '\n'), (2, '<<'), (2, '/Length'), (1, ' '), (3, '6'), (1, ' '), (3, '0'), (1, ' '), (3, 'R'), (2, '/Filter'), (1, ' '), (2, '/FlateDecode'), (2, '>>'), (1, '\n')]
<<
/Length 6 0 R
/Filter /FlateDecode
>>
obj 6 0
Type:
Referencing:
[(1, '\n'), (3, '678'), (1, '\n')]
...
obj 4 0
Type: /Page
Referencing: 3 0 R, 11 0 R, 12 0 R, 13 0 R, 5 0 R
...
The lines like "obj 5 0" give the object number and version of a particular object in the PDF. The "Referencing" lines below show the objects referenced. A given object can reference any number of objects from zero to many.
To make the chart with GNUplot, we need to create an input file that shows "obj -> ref;" for all references. So for object #5, we'd have one line of output that shows "5 -> 6;". There would be no output for object #6, since it references zero objects. And we'd get 5 lines of output for object #4, "4 -> 3;", "4 -> 11;", and so on.
This seems like a job for awk. Frankly, I thought about just calling Davide Brini and letting him write this week's Episode, but he's already getting too big for his britches. So here's my poor, fumbling attempt:
$ pdf-parser.py CLKF.pdf |
awk '/^obj/ { objnum = $2 };
/Referencing: [0-9]/ \
{ max = split($0, a);
for (i = 2; i < max; i += 3) { print objnum" -> "a[i]";" }
}'
5 -> 6;
...
4 -> 3;
4 -> 11;
4 -> 12;
4 -> 13;
4 -> 5;
...
The first line of awk matches the "obj" lines and puts the object number into the variable "objnum". The second awk expression matches the "Referencing" lines, but notice that I added a "[0-9]" at the end of the pattern match so that I only bother with lines that actually include referenced objects.
When we hit a line like that, then we do the stuff in the curly braces. split() breaks our input line, aka "$0", on white space and puts the various fields into an array called "a". split() also returns the number of elements in the array, which we put into a variable called "max". Then I have a for loop that goes through the array, starting with the second element-- this is the actual object number that follows "Referencing:". Notice the loop update code is "i += 3", which allows me to just access the object number elements and skip over the other crufty stuff I don't care about. Inside the loop we just print out the object number and current array element with the appropriate punctuation for GNUplot.
Meh. It's a little scripty, I must confess. Mostly because of the for loop inside of the awk statement to iterate over the references. But it gets the job done, and I really did compose this on the command line rather than in a script file.
Let's see if Tim's plotting involves a trip to Scriptistan as well...
Tim's traveling
While I have been out of the country for a few weeks, I didn't have to visit Scriptistan to get my fu for this week. The PowerShell portion is a bit long, but I wouldn't classify it as a script even though it has a semicolon in it. We do have lots of ForEach-Object cmdlets, Select-String cmdlets, and Regular Expressions. And you know what they say about Regular Expressions: Much like violence, if Regular Expressions aren't working, you aren't using enough of it.
Instead of starting off with some ultraviolent fu, let's build up to that before we wield the energy to destroy medium-large buildings. First, let's find the object number and its references.
PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)\d+" -Context 0,2
> obj 5 0
Type:
Referencing: 6 0 R
> obj 6 0
Type:
Referencing:
> obj 15 0
Type:
Referencing: 16 0 R
...
> obj 4 0
Type: /Page
Referencing: 3 0 R, 11 0 R, 12 0 R, 13 0 R, 5 0 R
The output of pdf-parser.py is piped into the Select-String cmdlet which finds lines that start with "obj", are followed by a space (\s), then one or more digits (\d+). The Context switch is used to get the next two lines so we can later use the "Referencing" portion.
You might also notice our regular expression uses a "positive look behind", meaning that it needs to see "obj " before the number we want. This way we end up with just the object number being selected and not the useless text in front of it. This is demonstrated by Matches object shown below.
PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 | Format-List
IgnoreCase : True
LineNumber : 7
Line : obj 5 0
Filename : InputStream
Path : InputStream
Pattern : (?<=^obj\s)[0-9]+
Context : Microsoft.PowerShell.Commands.MatchInfoContext
Matches : {5}
...
To parse the Referencing line we need we need to use some more
PS C:\> ... | Select-Object -ExcludeProperty Context | Get-Member
TypeName: Microsoft.PowerShell.Commands.MatchInfoContext
Name MemberType Definition
---- ---------- ----------
Clone Method System.Object Clone()
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
DisplayPostContext Property System.String[] DisplayPostContext {get;set;}
DisplayPreContext Property System.String[] DisplayPreContext {get;set;}
PostContext Property System.String[] PostContext {get;set;}
PreContext Property System.String[] PreContext {get;set;}
The PostContext property contains the two lines that followed our initial match. We can access the second line by access the row with an index of 1 (remember, base zero, so 1=2).
PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 |
ForEach-Object { $objnum = $_.matches[0].Value; $_.Context.PostContext[1] }
Referencing: 6 0 R
Referencing:
Referencing: 16 0 R
Referencing:
Referencing: 25 0 R
...
The above command saves the current object number in $objnum and then outputs the second line of the PostContext.
Finally, we need to parse the Context with
PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 |
% { $objnum = $_.matches[0].Value; $_.Context.PostContext[1] |
Select-String "(\d+(?=\s0\sR))" -AllMatches | Select-Object -ExpandProperty matches |
ForEach-Object { "$($objnum)->$($_.Value)" } }
5 -> 6;
...
4 -> 3;
4 -> 11;
4 -> 12;
4 -> 13;
4 -> 5;
...
The second line of PostContect, the Referencing line, is piped into the Select-String cmdlet where we use our regular expression to look for the a number followed by "<space>0<space>R". The AllMatches switch is used to find all the objects referenced. We then Expand the matches property so we can work with each match inside our ForEach-Object cmdlet where we output the original object number and the found reference.