Tuesday, March 23, 2010

Episode #87: Making a Hash of Things

Tim needs some hash:

The idea for this week's episode is brought to you from Latvia, by one of our fantastic followers, Konrads Smelkovs. He writes:

I recently had the need to compare two directory trees based on their hash sum, as I could not rely on modified/created or size attributes.

This is what I came up with in powershell, but I am sure there is more elegant method:

< Konrads had a nice bit of hash fu right here, but it was stopped by those customs' dogs upon entry to the US.>

Hashing and PowerShell, I find that hashing cmdlets are one of the most glaring omissions from PowerShell. Hopefully this will be added in v3, but we will have to wait and see. Since it isn't built-in, we will have to add it, and we have a few ways to do it.

1. Use functions

This is exactly what Konrads did. In fact, I took most of his commands for use below. His function was a bit different than mine, but the results are pretty much the same. I'll skip the inner workings of this function since the guts aren't the important part and we have a lot to cover.

PS C:\> function Get-MD5Hash ($file) {
$hasher = [System.Security.Cryptography.MD5]::Create()
$inputStream = New-Object System.IO.StreamReader ($file)
$hashBytes = $hasher.ComputeHash($inputStream.BaseStream)
$builder = New-Object System.Text.StringBuilder
$hashBytes | Foreach-Object { [void] $builder.Append($_.ToString("X2")) }

On no, I think we left CommandLineLand and crossed over to Script-istan. Everyone needs to visit the Scripti people once and a while, and if we add this function to our profile we won't have to go back. At least we can now we can get the MD5 hash of any file:

PS C:\> Get-MD5Hash file.txt

If we wanted to run this recursively on a bunch of files we would do this.

PS C:\> ls -r | ? { -not $_.PSIsContainer } | % { Get-MD5Hash $_.FullName }

In this command we first get a recursive directory listing using the Get-ChildItem cmdlet (alias ls, dir, gci) with the Recurse parameter (-r for short). Next, we filter out all Containers (directories) using Where-Object (alias ?). Finally, we compute the hash of each item using our function inside the ForEach-Object scriptblock.

I would assume many of you are asking, "Why is there a property called PSIsContainer instead of IsDirectory or IsFolder?" The reason is, PowerShell was designed to be extensible. The Get-ChildItem cmdlet is used on most Providers, and the File System is one such Provider.

Next you ask, "What is a Provider?" Providers provide access to data and components that
would not otherwise be easily accessible at the command line. The data is presented in a consistent format that resembles the file system drive. Examples include the Registry (ls hklm:\) and the certificate store (ls cert:\).

Now that your questions have been answered, let's get back to our hashing.

Our previous output was terrible since we don't know from which file the hash is derived. We lost all the information about the file. What if we could just add the hash as a property of the file? Well, we can!

We can add properties to any object by using the Add-Member cmdlet:

PS C:\> ls -r | ? { -not $_.PSIsContainer } | % { $_ | Add-Member
-MemberType NoteProperty -Name Hash -Value (Get-MD5Hash $_.FullName) -PassThru } |
Select Name,Hash

Name Hash
---- ----
file1.txt C11D8024A08381ECD959DB22BC2E7784
file2.txt 71DE107BEFF4FC5C34CF07D6687C8D84
file3.txt CB45F2C9DC3334D8D6D9872B1A5C91F6
file4.txt 8B4A8D66EB3E07240EA83E96A7433984
file5.txt 0E3315D930E7829CCDE63B123DD61663

Add-Member is used to add custom properties and methods to an instance of an object. The first parameter we need is the MemberType. We are going to create a NoteProperty, a property with a static value. There are a lot of options here, and you can see a full list here. The next two parameters, Name and Value, are pretty self explanatory. The PassThru parameter passes the new object down the pipeline. Finally, we select the Name and Hash to be displayed.

So we manually added the property for each object, what if we wanted the property to be permanent?

2. Update the File Data Type

This sounds hard, but it is actually pretty easy. All we need to do is create a little xml file to extend the File object (technically the System.IO.FileInfo data type). One thing to note, the xml file you create must be saved with the .ps1xml extension. Here is the ps1xml file based on the function above:

<GetScriptBlock>$hasher = [System.Security.Cryptography.MD5]::Create()
$inputStream = New-Object System.IO.StreamReader ($this)
$hashBytes = $hasher.ComputeHash($inputStream.BaseStream)
$builder = New-Object System.Text.StringBuilder
$hashBytes | Foreach-Object { [void] $builder.Append($_.ToString("X2")) }

To extend the Data Type it is as simple as running this command:

PS C:\> Update-TypeData hash.ps1xml

If you add this command to your profile (see episode #83) it will load automatically. Now we can get the MD5 hash of any file.

PS C:\> ls file1.txt | select name, md5

Name MD5
---- ----
file1.txt C11D8024A08381ECD959DB22BC2E7784

One cool bit is that it won't compute the hash unless you access the property. The default output doesn't display the hash so it won't slow down your system. There is a weird catch though and I can't find an explanation for it. Here is what I mean.

PS C:\> ls -r | select name,md5

Name MD5
---- ---

PS C:\> ls -r *.* | select name,md5

Name MD5
---- ---
file1.txt 32F1F5E65258B74A57A4E4C20A87C946
file2.txt EA42DB0F1DCFE6D1519AAC64171D2F37

For some reason, besides accessing the md5 property, you also have to give it a path in order to get output. A little odd, but easy to get around if you know about it.

Now let's take a look at the third option.

3. PowerShell Community Extensions

We could use any add-in, but I find the PowerShell Community Extensions to be the best general purpose add-in. PSCX adds a number of cmdlets that are missing from PowerShell. Today, the cmdlet we care about is Get-Hash, and it is way more powerful than the function we wrote in section 1. Not only does it give us the ability to get MD5 (default), it also provides us with SHA1, SHA256, SHA384, SHA512, and RIPEMD160. But wait, there's more! It can even hash a string, while my function (as written) can't. It even creates a nice little object that contains the Path and the Hash, so we don't have to do all the object creation.

PS C:\temp> ls -r | ? { -not $_.PSIsContainer } | % { Get-Hash $_ } | select path,hashstring

Path HashString
---- ----------
c:\dir1\file1.txt C11D8024A08381ECD959DB22BC2E7784
c:\dir1\file2.txt 71DE107BEFF4FC5C34CF07D6687C8D84
c:\dir1\file3.txt CB45F2C9DC3334D8D6D9872B1A5C91F6
c:\dir1\sub\file4.txt 8B4A8D66EB3E07240EA83E96A7433984

Depending on where and what you are using it for, each of these options fits a different niche. But for day to day work on a system I recommend installing pscx.

Comparing directories

For this portion I am going to use the PowerShell Community Extensions since that is my preferred method. We touched on using the Compare-Object cmdlet in episode #73 but we have another cool way to find files that don't have matching hashes.

PS C:\> ls dir1,dir2 -r | ? { -not $_.PSIsContainer} | % { Get-Hash $_.FullName } | 
group HashString | ? { $_.Count -eq 1 } | select -ExpandProperty Group

Path : c:\dir1\sub\file4.txt
HashString : 8B4A8D66EB3E07240EA83E96A7433984

Path : c:\dir2\sub\file4.txt
HashString : 0B39DC79962BC3CEA40EBF14336BFC4D

Let's break this down:

PS C:\> ls dir1,dir2 -r | ? { -not $_.PSIsContainer} | % { Get-Hash $_.FullName }

In section 1 we did something very similar so we'll skip the detailed explanation. However, this is one cool trick here. The recursive directory listing is given two directories separated by commas. Now let's take a look at the rest of the command:

... | group HashString | ? { $_.Count -eq 1 } | select -ExpandProperty Group

Using the Group-Object cmdlet will create a collection of objects with matching HashStrings. We then filter on groups with only one item and we are then left with files that do not match any other files. The output of the Group-Object is a Group data type and we have to expand it back into our original object. We now have our output.

I'm getting mighty hungry after all that hash. Hey dude, your turn man.

Hal has the munchies too:

Synchronicity is an interesting thing. Shortly before Kondrads sent us his message, I had to solve exactly this problem during a forensic investigation. Law enforcement had obtained several copies of a suspect's email-- in maildir format, and captured months apart-- and wanted us to extract all of the unique messages from the directories.

Happily, this is a lot easier to do in Unix than it is in Windows. For one thing, there are typically a lot of options for generating file hashes in a typical Unix environment. I'll go with the md5sum command this time, because it produces output that's easy to deal with for this particular task:

$ md5sum dir1/cur/file1
b026324c6904b2a9cb4b88d6d61c81d1 dir1/cur/file1

Generating checksums across an entire directory is just a matter of applying a bit of find and xargs action:

$ find dir1 -type f | xargs md5sum
6d7fce9fee471194aa8b5b6e47267f03 dir1/cur/file3
b026324c6904b2a9cb4b88d6d61c81d1 dir1/cur/file1
1dcca23355272056f04fe8bf20edfce0 dir1/cur/file5

To find the common files between multiple directories, I simply put multiple directory names into my find command and then sorted the md5sum output so that the duplicate files were grouped together:

$ find dir1 dir2 -type f | xargs md5sum | sort
166d77ac1b46a1ec38aa35ab7e628ab5 dir2/new/file11
1dcca23355272056f04fe8bf20edfce0 dir1/cur/file5
1dcca23355272056f04fe8bf20edfce0 dir1/new/file5
1dcca23355272056f04fe8bf20edfce0 dir2/cur/file5

Adding a quick while loop allowed me to pick out the duplicated checksum values and output just a list of the unique file names:

$ find dir1 dir2 -type f | xargs md5sum | sort | 
while read hash file; do [ "X$hash" != "X$oldhash" ] && echo $file; oldhash=$hash; done


In the while loop we're using read to assign the hash and the file name to variables. If the current hash is different from the previous hash, then output the file name. Then assign the current hash value to be the "previous" hash and read in the next line.

I'd say we're playing a dangerous game of cat and mouse with the Scriptistanian border guards here, but not actually creating an incursion on Scripti soil. I can enter the above code pretty easily on the command-line-- indeed, that's how I generated the example output above. It's certainly much more reasonable than Tim's blatant violation of our "no scripting" rule.

If you wanted to clean up the output a bit, you could add one last sort command at the end:

$ find dir1 dir2 -type f | xargs md5sum | sort | 
while read hash file; do [ "X$hash" != "X$oldhash" ] && echo $file; oldhash=$hash; done | sort


And that's my final answer, Regis.

Or at least it was until I got this little tidbit from loyal reader and official "friend of the blog", Jeff Haemer:

$ find dir1 dir2 -type f | xargs md5sum | sort -u -k 1,1 | awk '{$1=""; print}'

While I had known about the "-u" ("unique") option in sort to eliminate duplicate lines, I had no idea you could combine it with "-k" to force the uniq-ifying to happen on specific columns. That's some sexy fu, Jeff!

You might be asking yourself why Jeff didn't just do "awk '{print $2}'" there. Remember that would only work as long as the file name didn't contain any spaces. Jeff's command, while more complicated, is less prone to error.

Note that you could also use sed instead of awk at the end:

$ find dir1 dir2 -type f | xargs md5sum | sort -u -k 1,1 | sed -r 's/^[^ ]+[ ]+//'

Either way, we're basically getting rid of the while loop and replacing it with an implicit while loop in the form of an awk or sed command. But it's cool anyway. Thanks, Jeff!