Tuesday, April 19, 2011

Episode #143: Unicode to Shellcode

Hal checks in from Bali

Just a quick one this week. After finishing up with my posse of hard-rocking malware-crazed weasels from SANS Bali, I'm due for a little R&R and, hey, did I mention I'm in Bali? But sun, sand, surf, and sky will not prevent me from providing our faithful readers with their weekly dose of Command Line Kung Fu. Besides, teaching Lenny Zeltser's excellent Reverse Engineering Malware class always gives me ideas for new Fu.

This week's challenge is a fairly common one for malware analysts. Many times you have Javascript code from web pages, Microsoft Office documents, or PDF files that contain shellcode payloads in various different obfuscated forms. One encoding type is two-byte, little-endian unicode represented by strings such as %u8806. Today's sample goes that one better though, by introducing extra whitespace and punctuation like so:

'%u5350%u5', '251%u5756%', 'u9c55%u00', 'e8%u0', '000%u5d00', ...

Lenny came up with some tasty Perl code to extract the necessary bytes and reformat them into hex notation such as "\x06\x88" (and from there they can be converted to an executable to analyze with a tool such as While I'm a big fan of Perl, it's against the rules of our little blog experiment. So I started wondering what the sed solution would look like.

Wonder no more:

$ sed 's/[^0-9A-Fa-f]//g; 
s/\(..\)\(..\)/\\x\2\\x\1/g' shellcode.uni >shellcode.hex

The first substitution gets rid of anything that isn't a hex digit. Not only does this clean up the whitespace, quotes, and commas, it even eliminates all those "%u"s for us. Next we need to grab each byte-- represented by two hex digits-- and put a "\x" in front of it. The tricky part is that the original unicode was in little-endian format, so we must swap each pair of bytes as we work through the string of characters. That's why you see me grabbing two bytes at a time and outputting them with "...\2...\1..." on the right-hand side of the substitution.

Well that's all from my little corner of the South Pacific. I must go now as I hear something calling me...

Tim checks in from a spot where the @$%#$%^ snow is falling:

The approach is the same with PowerShell: 1) read the file, 2) remove characters that don't represent hex, and 3) swap the pairs of characters while adding a "\x".

PS C:\> Get-Content shellcode.uni |
ForEach-Object { $_ -replace '[^0-9A-F]', '' } |
ForEach-Object { $_ -replace '(..)(..)', '\x$2\x$1' }


Notice the command above uses single quotes. That is because PowerShell will expand any strings inside double quotes before our Replace method has a chance to do any replacing. This means that PowerShell would try to convert $1 into a variable and not pass the literal string to the Replace method. If you happened to use double quotes you would get output like this:

PS C:\> Get-Content shellcode.uni |
ForEach-Object { $_ -replace '[^0-9A-F]', '' } |
ForEach-Object { $_ -replace '(..)(..)', "\x$2\x$1" }


A longer explanation can be found toward the end of Episode #126.

Now we have the output we want, but let's shorten up the command by using aliases:
PS C:\> gc shellcode.uni |
% { $_ -replace '[^0-9A-F]', '' } |
% { $_ -replace '(..)(..)', '\x$2\x$1' }

We can also remove the ForEach-Object cmdlets and shorten it further:

PS C:\> ((gc shellcode.uni) -replace '[^0-9A-F]', '') -replace '(..)(..)', '\x$2\x$1'

We have one minor problem, if shellcode.uni contains line breaks then each line will be read separately and the breaks won't be removed. If there are multiple lines of text in the file then Get-Content will return an array of objects instead of a multiline string.

PS C:\> (gc shellcode.uni).getType()

IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True Object[] System.Array

We can fix this by converting it to a string.

PS C:\> ([string](gc shellcode.uni)).getType()

IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True String System.Object

Our robust shortened version of the command looks like this:

PS C:\> ([string](gc shellcode.uni) -replace '[^0-9A-F]', '') -replace '(..)(..)', '\x$2\x$1'

The final piece is to output our results, and we can use the Out-File cmdlet to do it. However, the default output encoding for PowerShell is unicode which isn't what a disassembler is expecting since it isn't the shellcode we want. We have to tell PowerShell to use ASCII by using the -Encoding parameter.

PS C:\> ... | Out-File shellcode.hex -Encoding ASCII

This our final command looks like this:

PS C:\> ([string](gc shellcode.uni) -replace '[^0-9A-F]', '') -replace '(..)(..)', '\x$2\x$1' |
Out-File shellcode.hex -en ASCII

Well that's all from my frozen corner of the Minnesota. I must go now as I hear something calling me...