Wednesday, April 22, 2009

Episode #26: Renaming Files With Regular Expressions

Hal Says:

I admit it, I'm a fan of the CommandLineFu site (hey, it's a community, not a competition), and like trolling through it occasionally for interesting ideas. This post by vgagliardi shows a cool trick for renaming a bunch of files using regular expressions and the substitution operator in bash. For example, suppose I wanted to convert spaces to underscores in all file names in the directory:

$ for f in *; do mv -- "$f" "${f// /_}"; done

I realize that syntax looks a little crazy. The general form is "${variable/pattern/substitution}", but in this case we have an extra "/" at the front of "pattern", which means "replace all instances" rather than only replacing the first instance.

By the way, you can use the standard Unix regular expression syntax for your substitution pattern. For example, here's a loop to remove all characters from file names except for alphanumeric characters, dot, hypen, and underscore:

$ for f in *; do mv -- "$f" "${f//[^-_.A-Za-z0-9]/}"; done

In this case "[^...]" means match any characters not in the specified set, and we're performing a null substitution.

Did you notice that we're also using the "--" argument to the "mv" command, just in case one of our file names happens to start with a "-"? These files are typically a huge pain in Unix. What if we wanted to replace all the "-" characters at the beginning of a file name with underscores?

$ for f in *; do mv -- "$f" "${f/#-/_}"; done

As you can see, starting the pattern with "#" means "match at the front of the string. Or we can match at the end of the string with "%":

$ find docroot -type f -name \*.htm | \
while read f; do mv -- "$f" "${f/%.htm/.html}"; done

Here we're using "find" to locate all of the *.htm files in our web docroot and then piping the output into a while loop that renames all these files to be *.html files instead.

There are a couple of problems with the method that I'm using here: (1) if multiple files map to the same name, you'll end up clobbering all but the last instance of that file, and (2) you get errors if your substitution doesn't actually modify the file name because the "mv" command refuses to rename a file to itself. We can fix both of these problems with a little extra logic in the loop. Let's return to our first example of converting spaces to underscores:

$ for f in *; do n="${f// /_}"; [ -f "$n" ] || mv -- "$f" "$n"; done


First we assign the new file name to the variable $n. Then we check to see if a file named "$n" exists-- the "mv" command after the "||" is only executed if there is no "$n" file.

I admit that I usually use the Perl rename program for renaming large numbers of files, because (a) the syntax is much more terse, and (b) I love Perl. But this program isn't always available on all the different flavors of Unix that I end up having to work on. So having this functionality built into the shell is a huge win.

Ed Responds:
When I quickly glanced at Hal's challenge initially, I thought... "Yeah, that's pretty easy... findstr supports regex, and I'll use the ren command to rename the files... No prob."

And then, I started to write the command, and it got horribly ugly really quickly. Hal squealed with delight when I told him how ugly it was... and believe me... you ain't seen nothing until you've seen Hal squeal with delight.

Anyway, to keep this article from getting unreasonably long, I'm going to address Hal's original command, which replaced the spaces in file names with underscores. Unfortunately, you see, the parsing, iteration, and recursion capabilities within a single command in cmd.exe are really limiting. For parsing strings, we've got FOR /F and a handful of substring operations I covered in Episode 12. For running a command multiple times, we've got for /L, as I mentioned in Episode 3. For recursion, well, that's just plain bad news in a single command unless we bend our rules to create a bat file that calls itself.

To start to address Hal's original challenge, we can use the following command to determine if there are any files that have at least one space in their names in our current directory:
C:\> dir /b "* *"

That's pretty straightforward to start, with the /b making dir show only the bare form of output, omitting cruft about volume names and sizes. Note that it will only show files that do not have the hidden attribute set. If you want, you can invoke dir with /a to make it show files regardless of their attributes, hidden or otherwise. Now, let's see what we can do with this building block.

Plan A: Every File Should Have Four Spaces in Its Name, Right?
My original plan was to wrap that command inside a FOR /F loop, iterating over each file using FOR /F functionality to parse it into its constituent elements. I was thinking something like this:

C:\> for /F "tokens=1-4" %i in ('dir /b "* *"') do ren "%i %j %k %l" "%i_%j_%k_%l"

Well, that's all very nice, but we've got a problem... let me show an example of what this beast creates when I run it in a directory with a file named "file 1.txt" and "file 2 is here.txt":

C:\> dir /b
file_1.txt__
file_2_is_here.txt

Ooops... the file1.txt name has two underscores after it. This option only works if files have exactly four spaces in their names. That's no good.

Plan B: Let's Just Change the First Space into an Underscore
Well, how about this... We could write a one-liner that assumes a file will have only one space in its name, and convert that one space into an underscore. That's not too bad:

C:\> for /f "tokens=1,*" %i in ('dir /b "* *"') do ren "%i %j" "%i_%j"

I'm parsing the output of the dir /b command using parsing logic of "tokens=1,*", which means use your default delimiters of space and break each line of the output of the dir command into the entity before the first space into %i, and everything afterward into the next iterator variable, %j.

Let's run that with our same file names as before, yielding:

C:\> dir /b
file_1.txt
file_2 is here.txt

Well, we got it right for file_1.txt, because there is only one space. But, we only fixed the first space in file 2 is here.txt. Hmmmm... How could we move closer to our result?

Hit the up arrow a couple times to go back to our Plan B FOR /F loop, and hit enter again. Now, running our dir, we get:

C:\> dir /b
file_1.txt
file_2_is here.txt

Ahh... we nailed our second space. Hit the up arrow again and re-run... and... well, you get the picture. We can take care of one space at a time. Not so bad.

But, who wants to hit the up arrow again and again and again until we get rid of all the spaces? You'd have to re-run my Plan B command N times, where N is the maximum number of spaces inside a file name in the current directory.

Plan C: Make the Shell Re-Run the Command Instead of Doing it Manually
Well, instead of re-running a command a bunch of times, let's make the shell do our work for us. We'll just wrap the whole thing in a FOR /L loop to count through integers 1 through 10 (1,1,10) and invoke the FOR /F loop at each iteration through our FOR /L loop:

C:\> for /L %a in (1,1,10) do @for /f "tokens=1,*" %i in ('dir /b "* *"')
do ren "%i %j" "%i_%j"

That works, provided that none of the files have more than ten spaces in their name. Ummm... but what if they do? We could raise the number 10 to 20... but that's kind of a cheap hack, no?

Plan D: Violate the Rules -- Make a 3-Line Script
OK... if we had a while construct in cmd.exe, we could simply run my FOR /F loop of Plan B while the dir /b "* *" still returned valid output. But, we don't have a while command in cmd.exe. If we want to check a condition like that, we only have IF statements. And, if we want to jump around based on the results of IF statements, we need to use GOTOs. And, if we want to use IFs and GOTOs, we can't dump everything on a single one-line command, but will instead have to create a little bat file.

So, I'm going to have to bend our ground rules for this blog, which require a single command, and instead use a three-line bat file. Here's a bat file I wrote that converts all of the names of files in the current directory with spaces in them into underscores:

:begin
for /F "tokens=1,*" %%i in ('dir /b "* *"') do ren "%%i %%j" "%%i_%%j"
if exist "* *" goto begin

There you have it.... kind of an ugly little hack, but it works. Note that I had to change my iterator variables in my FOR loop from %i and %j into %%i and %%j. You have to do that to convert command-lines into bat files in Windows. Also, I'm using an IF statement to test for the existence of "* *", which would match any file with a space in its name.

A small script in cmd.exe can satisfy Hal's original challenge. To start addressing his other feats to convert other characters in file names, we could specify options for the FOR /F loop of everything we want to parse out with the syntax "tokens=1,* delims=~!@#$%^&*()+=" and whatever else you wanna take out.

I could drone on and on endlessly here, but I think you get the idea. It ain't pretty, but it is doable.... Now that should be the cmd.exe mantra.

PS: I too am a fan of the CommandLineFu site. It rocks.