Tuesday, October 18, 2011

Episode #160: Plotting to Take Over the World

Hal's been teaching

Whew! Just got done with another week of teaching, this time at SANS Baltimore. I even got a chance to give my "Return of Command Line Kung Fu" talk, so I got a bunch of shell questions.

One of my students had a very interesting challenge. To help analyze malicious PDF documents, he was trying to parse the output of Didier Stevens' pdf-parser.py and create an input file for GNUplot that would show a graph of the object references in the document. Here's a sample of the kind of output we're dealing with:

$ pdf-parser.py CLKF.pdf
PDF Comment '%PDF-1.3\n'

PDF Comment '%\xc7\xec\x8f\xa2\n'

obj 5 0
Type:
Referencing: 6 0 R
Contains stream
[(1, '\n'), (2, '<<'), (2, '/Length'), (1, ' '), (3, '6'), (1, ' '), (3, '0'), (1, ' '), (3, 'R'), (2, '/Filter'), (1, ' '), (2, '/FlateDecode'), (2, '>>'), (1, '\n')]

<<
/Length 6 0 R
/Filter /FlateDecode
>>


obj 6 0
Type:
Referencing:
[(1, '\n'), (3, '678'), (1, '\n')]
...


obj 4 0
Type: /Page
Referencing: 3 0 R, 11 0 R, 12 0 R, 13 0 R, 5 0 R
...

The lines like "obj 5 0" give the object number and version of a particular object in the PDF. The "Referencing" lines below show the objects referenced. A given object can reference any number of objects from zero to many.

To make the chart with GNUplot, we need to create an input file that shows "obj -> ref;" for all references. So for object #5, we'd have one line of output that shows "5 -> 6;". There would be no output for object #6, since it references zero objects. And we'd get 5 lines of output for object #4, "4 -> 3;", "4 -> 11;", and so on.

This seems like a job for awk. Frankly, I thought about just calling Davide Brini and letting him write this week's Episode, but he's already getting too big for his britches. So here's my poor, fumbling attempt:

$ pdf-parser.py CLKF.pdf |
awk '/^obj/ { objnum = $2 };
/Referencing: [0-9]/ \
{ max = split($0, a);
for (i = 2; i < max; i += 3) { print objnum" -> "a[i]";" }
}'

5 -> 6;
...
4 -> 3;
4 -> 11;
4 -> 12;
4 -> 13;
4 -> 5;
...

The first line of awk matches the "obj" lines and puts the object number into the variable "objnum". The second awk expression matches the "Referencing" lines, but notice that I added a "[0-9]" at the end of the pattern match so that I only bother with lines that actually include referenced objects.

When we hit a line like that, then we do the stuff in the curly braces. split() breaks our input line, aka "$0", on white space and puts the various fields into an array called "a". split() also returns the number of elements in the array, which we put into a variable called "max". Then I have a for loop that goes through the array, starting with the second element-- this is the actual object number that follows "Referencing:". Notice the loop update code is "i += 3", which allows me to just access the object number elements and skip over the other crufty stuff I don't care about. Inside the loop we just print out the object number and current array element with the appropriate punctuation for GNUplot.

Meh. It's a little scripty, I must confess. Mostly because of the for loop inside of the awk statement to iterate over the references. But it gets the job done, and I really did compose this on the command line rather than in a script file.

Let's see if Tim's plotting involves a trip to Scriptistan as well...

Tim's traveling

While I have been out of the country for a few weeks, I didn't have to visit Scriptistan to get my fu for this week. The PowerShell portion is a bit long, but I wouldn't classify it as a script even though it has a semicolon in it. We do have lots of ForEach-Object cmdlets, Select-String cmdlets, and Regular Expressions. And you know what they say about Regular Expressions: Much like violence, if Regular Expressions aren't working, you aren't using enough of it.

Instead of starting off with some ultraviolent fu, let's build up to that before we wield the energy to destroy medium-large buildings. First, let's find the object number and its references.

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)\d+" -Context 0,2


> obj 5 0
Type:
Referencing: 6 0 R
> obj 6 0
Type:
Referencing:
> obj 15 0
Type:
Referencing: 16 0 R
...
> obj 4 0
Type: /Page
Referencing: 3 0 R, 11 0 R, 12 0 R, 13 0 R, 5 0 R


The output of pdf-parser.py is piped into the Select-String cmdlet which finds lines that start with "obj", are followed by a space (\s), then one or more digits (\d+). The Context switch is used to get the next two lines so we can later use the "Referencing" portion.

You might also notice our regular expression uses a "positive look behind", meaning that it needs to see "obj " before the number we want. This way we end up with just the object number being selected and not the useless text in front of it. This is demonstrated by Matches object shown below.

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 | Format-List


IgnoreCase : True
LineNumber : 7
Line : obj 5 0
Filename : InputStream
Path : InputStream
Pattern : (?<=^obj\s)[0-9]+
Context : Microsoft.PowerShell.Commands.MatchInfoContext
Matches : {5}
...


To parse the Referencing line we need we need to use some more violence regular expressions on the Context object. First, let's see what the Context object looks like. To do this we previous command into the command below to see the available properties.

PS C:\> ... | Select-Object -ExcludeProperty Context | Get-Member

TypeName: Microsoft.PowerShell.Commands.MatchInfoContext

Name MemberType Definition
---- ---------- ----------
Clone Method System.Object Clone()
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
DisplayPostContext Property System.String[] DisplayPostContext {get;set;}
DisplayPreContext Property System.String[] DisplayPreContext {get;set;}
PostContext Property System.String[] PostContext {get;set;}
PreContext Property System.String[] PreContext {get;set;}


The PostContext property contains the two lines that followed our initial match. We can access the second line by access the row with an index of 1 (remember, base zero, so 1=2).

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 |
ForEach-Object { $objnum = $_.matches[0].Value; $_.Context.PostContext[1] }


Referencing: 6 0 R
Referencing:
Referencing: 16 0 R
Referencing:
Referencing: 25 0 R
...


The above command saves the current object number in $objnum and then outputs the second line of the PostContext.

Finally, we need to parse the Context with ultra violence more regular expressions and display our output.

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 |
% { $objnum = $_.matches[0].Value; $_.Context.PostContext[1] |
Select-String "(\d+(?=\s0\sR))" -AllMatches | Select-Object -ExpandProperty matches |
ForEach-Object { "$($objnum)->$($_.Value)" } }


5 -> 6;
...
4 -> 3;
4 -> 11;
4 -> 12;
4 -> 13;
4 -> 5;
...


The second line of PostContect, the Referencing line, is piped into the Select-String cmdlet where we use our regular expression to look for the a number followed by "<space>0<space>R". The AllMatches switch is used to find all the objects referenced. We then Expand the matches property so we can work with each match inside our ForEach-Object cmdlet where we output the original object number and the found reference.

Tuesday, October 4, 2011

Episode #159: Portalogical Exam

Tim finally has an idea

Sadly, we've been away for two weeks due to lack of new, original ideas for posts. BUT! I came up with and idea. Yep, all by myself too. (By the way, if you have an idea for an episode send it in)

During my day job pen testing, I regularly look at nmap results to see what services are available. I like to get a high level look at the open ports. For example, lots of tcp/445 means a bunch of Windows boxes. It is also useful to quickly see the one off ports, and in this line of work, the one offs can be quite important. One unique service may be legacy, special (a.k.a. not patched), or forgotten.

Nmap has a number of output options: XML, grep'able output, and standard nmap output. PowerShell really favors objects, which means that XML will work great. So let's start off by reading the file and parsing it as XML.

PS C:\> [xml](Get-Content nmap.xml)

xml xml-stylesheet #comment nmaprun
--- -------------- -------- -------
version="1.0" href="file:///usr/local/sh... Nmap 5.51 scan initiated ... nmaprun


Get-Content (alias gc, cat, type) is used to read the file, then [xml] parses it and converts it to an XML object. After we have an XML object, we can see all the nodes of the document. To access each node we access it like any property:

PS C:\> ([xml](gc nmap.xml)).nmaprun.host


But each host has a ports property that needs to be expanded, and each ports property has multiple port properties to be expanded. To do this we use a pair of Select-Object cmdlets with the -ExpandProperty switch (-ex for short).

PS C:\> ([xml](gc .nmap.xml)).nmaprun.host | select -expand ports | 
select -ExpandProperty port


protocol portid state service
-------- ------ ----- -------
tcp 22 state service
tcp 23 state service
tcp 80 state service
tcp 443 state service
tcp 80 state service
tcp 443 state service
...


Nmap can have information on closed ports, so I like to make sure that I am only looking at open ports. We use the Where-Object cmdlet (alias ?) to filter for ports that are open. Each port has a state element and a state property, and we'll check if the state of the state (yep, that's right) is open:

PS C> ... | ? { $_.state.state -eq "open" }


The output is the same, just with extra filtering. Now all we need to do is count. To do that, we use the Group-Object cmdlet (alias group).

PS C:\> ([xml](gc nmap.xml)).nmaprun.host | select -expand ports | 
select -ExpandProperty port | ? { $_.state.state -eq "open" } |
group protocol,portid -NoElement


Count Name
----- ----
12 tcp, 80
1 tcp, 25
12 tcp, 443
2 tcp, 53
...


The -NoElement switch tells the cmdlet to discard the individual objects and just give us the group information.

Of course, if we are looking for patterns or one off ports we need use the Sort-Object cmdlet (alias sort) by the Count and use the -Descending switch (-desc for short)..

PS C:\> ([xml](gc nmap.xml)).nmaprun.host | select -expand ports | 
select -ExpandProperty port | ? { $_.state.state -eq "open" } |
group protocol,portid -NoElement | sort count -desc


Count Name
----- ----
12 tcp, 443
12 tcp, 80
2 tcp, 53
2 tcp, 18264
...



Now that's handy, but many times I have multiple scans, like a UDP scan and a TCP scan. If we want to combine multiple scans into one table we can do it relatively easily.

PS C:\> ls *.xml | % { ([xml](gc $_)).nmaprun.host } | select -expand ports |
select -ExpandProperty port | ? { $_.state.state -eq "open" } |
group protocol,portid -NoElement | sort count -desc


Count Name
----- ----
12 tcp, 443
12 tcp, 80
3 udp, 161
2 tcp, 53
2 tcp, 18264
...


The beauty of PowerShell's pipeline is that we can use any method we want to pick the files, then feed them into the next command with a ForEach-Object loop (alias %).

Now that I've checked all my ports, it's time for Hal to get his checked.

Hal examines his options

Tim, when you get to be my age, you'll get all of your ports checked on an annual basis.

Now let's examine this so-called idea of Tim's. Oh sure, XML is all fine and dandy for fancy scripting languages like Powershell. But you'll notice he didn't even attempt to do this in CMD.EXE. Weakling.

While XML is generally a multi-line format and not typically conducive to shell utilities that operate on a "line at a time" basis, for something simple like this we can easily hack together some code. In the XML format, the lines that show open ports have a regular format:

<port protocol="tcp" portid="443"><state state="open" />...</port>

So pardon me while I throw down some sed:

$ sed 's/^<port protocol="\([^"]*\)" portid="\([^"]*\)"><state state="open".*/\2\/\1/; 
t p; d; :p' test.xml | sort | uniq -c | sort -nr -k1 -k2

6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp
...

The first part of the sed expression is a substitution that matches the protocol name and port number and replaces the entire line with just "<port>/<protocol>". Now I only want to output the lines where this substitution succeeded, so I use "t p" to branch to the label ":p" whenever the substitution happens. If we don't branch, then we hit the "d" command to drop the pattern space without printing and move onto the next line. Since the ":p" label we jump to on a successful substitution is an empty block, sed just prints the pattern space and moves onto the next line. This is a useful sed idiom for only printing our matching lines.

The rest of the pipeline puts the output lines from sed into sorted order so we can feed them into "uniq -c" to count the occurrences of each line. After that we use sort again to do a descending numeric sort ("-nr") of first the counts ("-k1") and then the port numbers ("-k2"). And that give us the output we want.

I actually find the so-called "grep-able" output format of Nmap kind of a pain to deal with for this kind of thing. That's because Nmap insists on jamming all of the port information together into delimited, but variable length lines like this:

Host: 192.168.1.2 (test.deer-run.com) Ports: 22/open/tcp//ssh///, 
25/open/tcp//smtp///, 53/open/tcp//domain///, 80/open/tcp//http///,
139/open/tcp//netbios-ssn///, 143/open/tcp//imap///, 443/open/tcp//https///,
445/open/tcp//microsoft-ds///, 514/open/tcp//shell///, 587/open/tcp//submission///,
601/open/tcp/////, 902/open/tcp//iss-realsecure-sensor///, 993/open/tcp//imaps///,
1723/open/tcp//pptp///, 8009/open/tcp//ajp13/// Seq Index: 3221019...

So to handle this problem, I'm just going to use tr to convert the spaces to newlines, forcing the output to have a single port entry per line. After that, it's just awk:

$ cat test.gnmap | tr ' ' \\n | awk -F/ '/\/\/\// {print $1 "/" $3}' | 
sort | uniq -c | sort -nr -k1 -k2

6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp
...

The port listings all end with "///", so I use awk to match those lines and output the port and protocol fields. Notice the "-F/" option so that awk uses the slash character as the field delimiter instead of whitespace. After that it's the same "sort ... | uniq -c | sort ..." pipeline we used in the last case to format the output.

The easiest case is actually the regular Nmap output:

$ awk '/^[0-9]/ {print $1}' test.nmap | sort | uniq -c | sort -nr -k1 -k2
6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp
...

The lines about open ports are the only ones in the output that start with digits. So it's a quick awk expression to match these lines and output the port specifier. After that, we use the same pipeline we used in the previous examples to format the output appropriately.

So, Tim, my shell may not have built-in tools to parse XML but it's apparently three times the shell that yours is. Stick that in your port and smoke it.

Because Davide sed So

Proving he's not just a whiz at awk, Davide Brini wrote in with a more elegant sed idiom for just printing the lines that match our substitution:

$ sed -n 's/^<port protocol="\([^"]*\)" portid="\([^"]*\)"><state state="open".*/\2\/\1/p' test.xml | 
sort | uniq -c | sort -nr -k1 -k2

6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp
...

"sed -n ..." suppresses the normal sed output and "s/.../.../p" causes the lines that match our substitution to be printed out. And that's much easier. Thanks, Davide!