Tuesday, December 1, 2009

Episode #71: Joining Up

Hal fields a question from IRC

Mr. Bucket passed along the following query from the PaulDotCom IRC channel:

What functionality is available to loop through multiple files, and write the output to a single file with some values on the same line? Ex: If one program gives me the hash of a file, and the other program outputs the name/size/etc of a file, can I output to the same file HASH-FileName-Size


I couldn't resist chortling with glee when this question came up, because it's another one of those "easy for Unix, hard for Windows" kinds of tasks. I never can resist sharing these "learning experiences" with my fellow co-authors.

First let's review our inputs. I'm going to use the openssl utility for generating checksums, since it's fairly generic to lots of different flavors of Unix at this point:

$ openssl sha1 *
SHA1(001.jpg)= a088531884ee5eb520e98b3e9e18283f29e13d25
SHA1(002.jpg)= 77febb1498b2926ee6a988c97f3457e38736456d
SHA1(003.jpg)= 922bcb001d025d747c2ee56328811a4270b62079
...

As you can see, it's pretty easy to generate a set of checksums over my directory of image files, but there's a bunch of cruft around the filename that's not really helpful. So let me get rid of that with some quick sed action:

$ openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/'
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d
003.jpg 922bcb001d025d747c2ee56328811a4270b62079
...

That's better! In the sed expression I'm using the "(.*)" sub-expressions to match the file name and the checksum in each line, and the substitution operator is replacing the original line with just the values of the sub-expressions. Slick.

Now that we've got the checksums, how do we produce the file sizes? I could just use "ls -l" of course. But since the questioner seems to only want "HASH-FileName-Size", I may as well just use "wc -c" to produce simpler output:

$ wc -c *
4227504 001.jpg
4600982 002.jpg
4271719 003.jpg
...

Now that I know what my inputs are going to be, the question is how to stitch them together? Luckily, Unix includes the join command for putting files together on arbitrary fields (we last saw the join command back in Episode #43). Now I could save the checksum output and the file sizes to separate files and then join the contents of the two files, but bash actually gives us a cooler way to handle this:

$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *)
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25 4227504
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d 4600982
003.jpg 922bcb001d025d747c2ee56328811a4270b62079 4271719
...

See the "<(...)" syntax? That's a little bit of bash file descriptor magic that allows us to substitute the output of a command in a place where a program would normally be looking for a file name. In this case it saves us the hassle of having to create intermediate output files to join together. The join command itself is pretty simple. We're telling the program to join the output of the two commands using the file names in the first field of input #1 and the second field of input #2. The only problem is that the join command isn't producing the "HASH-FileName-Size" output that the original questioner wanted. That's because join always outputs the joined field first, followed by the remaining fields from the first input (the checksum in this case), followed by the remaining fields from the second input (the file size). We'll have to use a little awk fu to re-order the fields:

$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *) \
| awk '{print $2 " " $1 " " $3}'

a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...

Mmmm, that's a tasty little bit of shell magic, isn't it? Let's see what Ed and Tim are cooking up.

Loyal reader Jeff Haemer points out that you don't need awk if you understand how to work join's "-o" option to select your output fields:

$ join -1 1 -2 2 -o 1.2,0,2.1 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *)
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...

Yep, join actually lets you select specific fields from each input file and specify the order you want them output in. Nice. Thanks, Jeff!


Ed retorts snidely:
Choosing a topic just because you think it's hard for us Windows guys, huh, Hal? Well, aren't you just a big ball of sunshine, a command-line Scrooge this holiday season? When I first read this one, I though... "Ugh... this is gonna be hard." Perhaps I was psyched out by your juvenile trash talk. Or, maybe I've just been hanging around in cmd.exe too long, and have gotten used to hard problems.

But, this one turned out to be surprisingly straight-forward and even non-ugly (well, beauty is in the eye of the beholder, I suppose). Here's the skinny:
C:\> FOR /f "tokens=1-2" %a in (name-hash.txt) do @for /f "tokens=1,2" %m
in (length-name.txt) do @if %a==%n echo %b %a %m
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719

I'm assuming that name-hash.txt contains, well, names and hashes, one pair per line. Likewise, length-name.txt contains lengths and names, again one pair per line.

As we know, FOR /F loops can parse through all kinds of crap, including the contents of files. I use a FOR /F loop with two tokens (giving me two variables) of %a (for the file name) and %b (allocated automagically, holding the hash). For each of the files described in name-hash.txt, I then construct the body of my FOR loop. It contains another FOR /F loop, again with two variables (the original question mentioned "etc" for extra stuff there... if you have more stuff, just up the number of tokens and echo the proper variables at the end). My inner FOR /F loop iterates through the length-name.txt file, placing its values in the variables %m (length) and %n (name).

Now, if I just echoed out %a %b %m %n, I'd be making all of the possible combinations of every pair of two lines in the original files. But, we want to pare that down. We only want to generate some output if the name from name-hash.txt (%a) matches the name from length-name.txt (%n). We do this with a little IF operation comparing the two variables. If they match, we then echo out hash (%b), name (%n), and size (%m).

Admittedly, the performance of this little command isn't great, as I have to run through every line of name-hash.txt, comparing the name by running through the entirety of length-name.txt. I don't stop when I've found a match, because, well, there could be another match somewhere. Also, if there is no match of the name between the two files, my command ignores that name, not issuing any output. But, I think that makes sense given what the questioner asks.

So, Tim... does PowerShell have a nifty little built-in or something to make this easier than running through a couple of FOR loops? Inquiring minds what to know.

Tim tags in for Ed:

For loops! We don't need no stinking For loops!

The first thing to do is import the files. Since there is a space between the columes we can use Import-CSV with a delimiter of the space character. Also, there is no header information so we have to specify it.

PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data
File Data
---- ----
001.jpg 4227504
002.jpg 4600982
003.jpg 4271719
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d
003.jpg 922bcb001d025d747c2ee56328811a4270b62079
...


We have all the data, so now it can be grouped by the file name using Group-Object (alias group).

PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data | group file

Count Name Group
----- ---- -----
2 001.jpg {@{File=001.jpg; Data=4227504}, @{File=001.jpg; Data=a088531884ee5eb520e98b3e9e18283f29e13d25}}
2 002.jpg {@{File=002.jpg; Data=4600982}, @{File=002.jpg; Data=77febb1498b2926ee6a988c97f3457e38736456d}}
2 003.jpg {@{File=003.jpg; Data=4271719}, @{File=003.jpg; Data=922bcb001d025d747c2ee56328811a4270b62079}}
...


We have the data grouped like we want, but we still need to massage it a bit so we can get the formate we want.

PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data |
group file | Select @{Name="Hash";Expression={$_.Group[1].Data}}, Name,
@{Name="Length";Expression={$_.Group[0].Data}}

Hash Name Length
---- ---- ------
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...


The Select-Object (alias select) cmdlet allows for custom expressions which was used to get the hash and the length. The "Group" object contains multiple items and each can be access by its index value, 0 is the length and 1 is the hash.

Fileless PowerShell

The initial task was to get the file name, length, and hash from separate files and combine them in to one. Let's try this again without using files.

This would be very easy if powershell just had a hashing cmdlet, but it doesn't. However, we can do hashing by using the .NET library and some very ugly PowerShell. Maybe in v3 we will get a Get-Hash cmdlet, but it seems as likely as the addition of Get-Unicorn or Get-MillionDollars.

So we need some hash, but not the kind that is illegal in 49 states, we need the hash of a file. Here is how we get it.

PS C:\> PS C:\> gci 001.jpg | % { (New-Object System.Security.Cryptography
.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) }


We use the SHA1CryptoServiceProvider .NET class, but it adds another bump since it doesn't take files as input and will only take a stream. It isn't hard to get the stream though, all we need to use is the OpenRead method of our file object. If that wasn't enough, there is another problem, the output.

PS C:\> PS C:\> gci 001.jpg | % { (New-Object System.Security.Cryptography
.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) }

160
136
83
24
...


The result is an array of bytes. So we have to convert that to hex and combine it together.

PS C:\> gci 001.jpg | % {$hash=""; (New-Object System.Security.Cryptography
.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) | % { $hash += $_.ToString("X2") }; $hash}

a088531884ee5eb520e98b3e9e18283f29e13d25


We use the ToString method with the format string X2 to convert each byte to hex. The X converts it to hex, and the 2 will make sure the output is two characters wide (0A vs A). We then use the variable $hash to stitch our bytes together to get the full hash.

Now let's see the full command.

PS C:\> gci *.* | select @{Name="Hash";Expression={$hash=""; (New-Object
System.Security.Cryptography.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) |
% { $hash += $_.ToString("X2") }; $hash}}, name, length

Hash Name Length
---- ---- ------
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...


The first thing we do is get all the files in the currect directory using Get-ChildItem (aliased as gci or dir). That is piped in to Select-Object (aliased as select) to get the hash, filename, and size. The Select-Object cmdlet allows us to get properties of the pipeline object as well as creating a custom expression. In our case we will use the custom expression to calculate the hash.

Our results are in object form and can be piped to a file with Out-File or Out-Csv.

So the task is complete, but let's pretend for a second we had the fictional Get-Hash cmdlet. If we had our leprachaun our command might look something like this:

PS C:\> gci *.* | select @{Name="Hash";Expression={Get-Hash $_ sha1}, name, length


If only getting hash was easier in Windows.