Tuesday, July 27, 2010

Episode #105: File Triage

Hal answers the mail:

Frank McClain, one of my former SANS For508 students, sent me some email containing a bit of Command-Line Fu that he and his co-worker Mark Hallman had developed. The problem they were trying to solve was to sort out Microsoft Office files carved out of a forensic image using a tool like foremost or scalpel. The file format signatures used by these tools don't distinguish different types of MS Office files, so when I run foremost for example, I end up with a directory of file names with the generic ".ole" extension:

# ls
00003994.ole 00004618.ole 00005146.ole 00005746.ole 00015410.ole
00004162.ole 00004722.ole 00005250.ole 00010994.ole 00051554.ole
00004226.ole 00004826.ole 00005354.ole 00011410.ole 00054226.ole
00004290.ole 00004890.ole 00005418.ole 00012274.ole
00004394.ole 00004954.ole 00005514.ole 00014154.ole
00004498.ole 00005018.ole 00005578.ole 00014858.ole
00004554.ole 00005082.ole 00005642.ole 00015330.ole

Happily, the Linux "file" command is not only able to distinguish which file types you have, but is also able to show you different information in the file meta-data:

# file 00004618.ole
00004618.ole: CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252,
Title: SANS Expense Report Spreadsheet, Subject: SANS Expense Report Spreadsheet,
Author: Jason Fossen, Keywords: SANS spreadsheet expense report, Comments: Version 2.0 --
Updated 2/17/02., Last Saved By: Hal Pomeranz, Name of Creating Application: Microsoft Excel,
Last Printed: Fri Oct 31 01:31:31 2003, Create Time/Date: Sat Sep 29 18:28:43 2001,
Last Saved Time/Date: Sat Aug 15 23:05:29 2009, Security: 0

There's lots of interesting information here, but for our purposes "Name of Creating Application: Microsoft Excel" is the helpful bit. Frank and Mark wanted to recognize the file type in the output of the "file" command and change the extension on the file from ".ole" to the appropriate Windows file extension-- ".xls" in this case. Here's their solution:

file -p *.ole | grep -i excel | awk -F: '{print $1}' | rename 's/\.ole/\.xls/'

"file -p" will dump the information about the files while attempting to preserve the file timestamps. Then we use grep to match the Excel files and awk to pull off the file name-- "-F:" splits on colons instead of whitespace and then we print the first field. The file names get fed into the rename command which changes the file extensions.

This example exercises one of my pet peeves: piping grep into awk. awk has built-in pattern matching, so we could rewrite the command line as:

file -p *.ole | awk -F: '/Excel/ {print $1}' | rename 's/\.ole/\.xls/'

Another problem in the context of the Command-Line Kung Fu blog is the use of the rename command, which isn't necessarily a built-in command on various Unix operating systems. Also, I'd like to have a single command that splits out all different types of MS Office files in a single command, rather than running one command for Excel spreadsheets, and then another for PowerPoint files, etc.

So here's my solution:

# file -p * | awk -F. '/Excel/ { system("mv "$1".ole "$1".xls") }; 
/PowerPoint/ { system("mv "$1".ole "$1".ppt") }'

# ls
00003994.ole 00004618.xls 00005146.xls 00005746.xls 00015410.ppt
00004162.xls 00004722.xls 00005250.xls 00010994.ppt 00051554.ppt
00004226.xls 00004826.xls 00005354.xls 00011410.ppt 00054226.ppt
00004290.xls 00004890.xls 00005418.xls 00012274.ppt
00004394.xls 00004954.xls 00005514.xls 00014154.ole
00004498.xls 00005018.xls 00005578.xls 00014858.ppt
00004554.xls 00005082.xls 00005642.xls 00015330.ppt

I'm using awk to differentiate between the Excel and PowerPoint files (by default, foremost automatically detects Word files and splits them out into another directory so I don't need to deal with them here). I then call system() which allows me to run a shell command-- in this case a mv command to rename the files with the appropriate extensions. But notice something subtle here: unlike Frank and Mark's command, I'm telling awk to split on period ("-F."). So $1 in this case only contains the "basename" of the file before the ".ole" extension. That makes my mv command a bit simpler, though the crazy quoting rules in awk make the whole thing rather ugly.

Careful readers will notice that there are still a couple of files in the directory with .ole extensions:

# file *.ole
00003994.ole: CDF V2 Document, corrupt: Can't read SSAT
00014154.ole: CDF V2 Document, corrupt: Can't read SSAT

These are files that matched foremost's signature for MS Office files, but which are not really Office docs. They're just random sequences of blocks that happened to match the file signature that foremost uses.

Poor Tim doesn't have anything like the "file" command in Windows. But I'm sure that finding a solution for this week's challenge will be a "character building" experience for him. In fact, I'm expecting him to be positively bursting with character very soon now...

Tim bursts:

Alas, there is no Windows equivalent to the file command, and I don't see it coming. We will have to do this the hard way. Time to build some character.

Before we jump into the episode, let's go over a bit of how the file command works so we can recreate some of its functionality. One of the ways that the command determines the file type is by looking at the 'magic number' of the file. The magic number is a byte sequence towards the beginning of the file and it is typically 16 or fewer bytes in length. Also, each file type has a unique signature.

The reason all the carved files are named *.ole is that the magic number is the same for all Microsoft Office documents for versions 97 through 2003. That means an Excel 97 document has the same magic number as an Word 2003 document. To better classify the document type we have to look elsewhere.

According to a few websites there is supposed to be a unique identifier between 6 and 8 bytes long at offset 512 (512 bytes into the file, counting from zero). I downloaded a bunch of documents and as well as created some myself and it seems the websites did not have a full list, nor was it consistent. There are more boring details here but I finally determined the best way to determine the file type. Towards the end of the file there was an indicator of the file type as shown here.

<0x00>Microsoft Office Word<0x00>
<0x00>Microsoft Office PowerPoint<0x00>
<0x00>Microsoft Office Excel 2003 Worksheet<0x00>

Unfortunately, Excel adds the version number to the end of the name so we aren't just limited to three possibilities. However, if we key off of the null byte, the words Microsoft Office, and the next word, then we can determine our file type.

Since a user can't (normally) type a null byte (0x00) into an Office document, we can be reasonably sure that if we find one of the search strings above it will accurately determine the file type. Here is how we do it.

PS C:\> ls *.ole | % { Select-String -Path $_.FullName
"(?<=`0Microsoft Office )([A-Z]+)" -List } | select Filename, Matches

Filename Matches
-------- -------
00003994.ole {Word}
00004162.ole {PowerPoint}
00004226.ole {Excel}

Cool, we can accurately identify the files, but how does it work? First we use Get-Content (alias cat) to dump the contents and pipe it into Select-String. The Select-String cmdlet uses a regular expression with a lookbehind to search for the word after a null byte (`0) + Microsoft Office.

Before we get really crazy, let's do some simple file renaming.

PS C:\> ls *.ole | ? { Select-String -Path $_.FullName "`0Microsoft Office Word`0" } |
% { mv $_ ($_.Name -replace ".ole", ".doc") }

PS C:\> ls *.ole | ? { Select-String -Path $_.FullName "`0Microsoft Office PowerPoint`0" } |
% { mv $_ ($_.Name -replace ".ole", ".ppt") }

PS C:\> ls *.ole | ? { Select-String -Path $_.FullName "`0Microsoft Office Excel" } |
% { mv $_ ($_.Name -replace ".ole", ".xls") }

First we get a listing of all the .ole files. The files are then filtered before they are passed down the pipeline. The filter looks inside each file, using Select-String, to find a given string. The files are then renamed using Move-Item (alias mv). The Move-Item cmdlet takes two parameters, the original file and the new file name. The original file is the object passed down the pipeline and is designated by $_. The new name is just the old name ($_.Name) where .ole replaced with the new file extension.

One problem, a file can be read up to three times before it is renamed, which will obviously slow things down. So let's speed this crap up.

PS C:\> ls *.ole | % { $a = $_; Select-String -Path $_.FullName
"(?<=`0Microsoft Office )([A-Z]+)" -List } | % { switch ($_.Matches[0]) {
"Excel" { mv $a ($a.Name -replace ".ole", ".xls") }
"Word" { mv $a ($a.Name -replace ".ole", ".doc") }
"PowerPoint" { mv $a ($a.Name -replace ".ole", ".ppt") } } }

Let's space this out so it is a bit easier to read:

PS C:\> ls *.ole | % {
$a = $_;
Select-String -Path $_.FullName "(?<=`0Microsoft Office )([A-Z]+)" -List } | % {
switch ($_.Matches[0]) {
"Excel" { mv $a ($a.Name -replace ".ole", ".xls") }
"Word" { mv $a ($a.Name -replace ".ole", ".doc") }
"PowerPoint" { mv $a ($a.Name -replace ".ole", ".ppt") }

This is similar to the first command line fu. We take all the .ole files and pipe them into the ForEach-Object cmdlet. We use a temporary variable ($a) to store our file object since we will need it further down the command.

Next, the Select-String cmdlet is used to grab the special word (Word/Excel/PowerPoint) following <null byte>Microsoft Office. The List switch is used to stop searching each file after the first match is found. Select string returns a MatchInfo object that contains our match.

The first match is the 0th item in the Matches collection. We use this as the input into our switch command. The switch is used to pick the correct file extension.

Ok, that wasn't pretty. That brought the pain. But at least I built a bit of character.

Seth Matheson apparently also needed a bit of character, because he concocted some tasty Mac OS fu to solve this problem using Spotlight. But I still think my awk is prettier than his case statement.