Tuesday, March 2, 2010

Episode #84: Fixing the Filenames

Hal Helps Out

A friend of mine contacted me the other day with an interesting problem. She was trying to recover some files from the backup of an old BBS. In particular, she was trying to get at the attachments for various postings.

The attachment files were in a big directory, but the file names unhelpfully used an internal attachment ID number from the BBS. So we had file names like "attachment.43567". Now my friend also had a text file she extracted from the BBS that mapped attachment IDs to the real file names:

43567  sekrit plans.doc
44211 pizza-costs.xls

So the task was to take the file of "attachment ID to file name mappings" and use that to rename the files in the attachments directory to their correct file names.

I thought about it for a minute, and realized the solution was actually pretty straightforward:

$ while read id file; do mv attachment.$id "$file"; done <id-to-filename.txt

The trickiest part of the exercise was dealing with the file names that had spaces in them, like "sekrit plans.doc". Luckily the format of the input file was "ID filename", which meant that I could treat everything after the first whitespace as the file name. And this is exactly what the builtin "read" command will do for you: in this case it puts the first whitespace delimited token into the $id variable and then jams whatever is left over into the last "$file" variable. Once I got the right information into $file, it was simply a matter of making sure to quote this variable appropriately in the "mv" command inside the loop.

So there you go-- a quick one-liner for me, but a real time-saver for my friend. And possibly a real time-sink for Tim and Ed as they try and figure out how to do this in their shells. Let's see, shall we?

Ed Frustrates Hal
Sorry, Hal, but this one just isn't crushing for me. I know that disappoints you, but sometimes (on fairly rare occasions) we don't have to work too hard to coax little cmd.exe to do what we want. It does take two little tricks, though, but nothing too freakish.

Here's the fu:
C:\> for /f "tokens=1,*" %i in (id_to_filename.txt) do @copy attachment.%i "%j"
I'm using a FOR /F loop to read the contents of id_to_filename.txt, one line at a time. Default delimiters of FOR /F parsing are spaces and tabs, which will work just fine for us here, so there's no need to mess with custom delims. I've specified custom parsing of "tokens=1,*", which will make it assign the first column of the file (the integer in Hal's example) to my first iterator variable (which is %i). Then, the ,* stuff means to assign all of the rest of the line to my second iterator variable, which will be auto-allocated as %j. The ,* stuff is the first trick, which really comes in handy.

Then, in the body of my loop, I turn off display of commands (@) and invoke the copy command to take the contents of attachment.%i and place it into "%j". The second trick, those quotes around %j, are important in allowing us to handle any spaces in the file name. Note that I'm using copy instead of move here, because I don't wanna play Ed-Zilla stomping over the city just in case something goes awry (who's to say that our id_to_filename.txt file will always look like we expect it to?). I guess you could call it the Hipposhellic oath: First do no harm. After we verify that our copy worked like we wanted with a quick dir command, we can always run "del attachment.*"

Whatcha got, Tim?

Tim frustrates most people
Sorry Hal, this isn't too bad in PowerShell either. There are a few ways we can accomplish this task, but I elected to pick the shortest version, which also happens to be the one that brings up something we haven't covered before. Here are the long version and short version of the fu. (The short version is identical but uses built in aliases)

PS C:\> Get-Content id-to-filename.txt | ForEach-Object { $id,$file =
$_.Split(" ",2); Rename-Item -Path attachment.$id -NewName $file }

PS C:\> gc id-to-filename.txt | % { $id,$file = $_.Split(" ",2); ren
attachment.$id $file }

The Get-Content cmdlet is used to read the contents of the file, and it is piped into Foreach-Object. Inside the Foreach-Object script block is where the line is split. The first parameter used in the Split method defines the delimiter and the second defines how many items it should be split into.

The only problem, the Split method's output is multi-line. Here is an illustration:

PS C:\> gc id-to-filename.txt -TotalCount 1 | % { $_.Split(" ",2); }
sekrit plans.doc

We need both portions of the split to do the rename, so here is where we bring up a new little trick. We can assign the output of split into variables. Each line is assigned to a variable, the first variable ($id) is assigned the first line and the second variable ($file) receives the remainder. After we have the Id and the Filename we can easily rename the files.

If we wanted to be a little safer then we could use Copy-Item (alias cp or cpi) instead of Rename-Item (alias ren or rni). Once we confirmed the copy was successful we can delete all the attachment files by using "Remove-Item attachment.*" (alias del, erase, ri, or rm).