Tuesday, August 18, 2009

Episode #56: Find the Missing JPEG

Hal is helpful:

As a way of giving back to the community, I occasionally drop in and answer questions on the Ubuntu forums. One of the users on the forum posed the following question:

I have about 1300 pictures that I am trying to organize.
They are numbered sequentially eg. ( 0001.jpg -> 1300.jpg )
The problem is that I seem to be missing some...

Is there a fast way to be able to scan the directory to see which ones I am missing? Other than to do it manually, which would take a long time.


A fast way to scan the directory? How about some command-line kung fu:

for i in $(seq -w 1 1300); do [ ! -f $i.jpg ] && echo $i.jpg; done

The main idiom that I think is important here is the use of the test operator ("[ ... ]") and the short-circuit "and" operation ("&&") as a quick-and-dirty "if" statement in the middle of the loop. Ed and I have both used this trick in various Episodes, but I don't think we've called it out explicitly. For simple conditional expressions, it sure saves a lot of typing over using a full-blown "if ... then ..." kind of construct.

I'm also making use of the seq command to produce a sequence of numbers. In particular, I'm using the "-w" option so that the smaller numbers are "padded with zeroes" so that they match the required file names. However, while seq is commonly found on Linux systems, you may not have it on other Unix platforms. Happily, bash includes a printf routine, so I could also write my loop as:

for ((i=1; $i <= 1300; i++)); do file=$(printf "%04d.jpg" $i); \
[ ! -f $file ] && echo $file; done


Update from a Loyal Reader: Jeff Haemer came up with a nifty solution that uses the brace expansion feature in bash 4.0 and later (plus a clever exploitation of standard error output) to solve this problem with much less typing. You can read about it in his blog posting.

Now I have to confess one more thing. The other reason I picked this problem to talk about is that I'm pretty sure it's going to be one of those "easy for Unix, hard for Windows problems". Let's see if Ed can solve this problem without using four nested for loops, shall we?

Ed retorts:
Apparently our little rules have changed. Now, it seems that we get to impose constraints on our shell kung fu sparing partners, huh? Hal doesn't want four nested FOR loops. As if I didn't have enough constraints working in cmd.exe, Hal the sadist wants to impose more. Watch out, big guy. Perhaps next time, I'll suggest you solve a challenge without using any letter in the qwerty row of the keyboard. That should spice things up a bit. Of course, you'd probably use perl and just encode everything. But I digress.

One of the big frustrations of cmd.exe is its limitations on formulating output in a completely flexible fashion. Counting is easy, thanks to the FOR /L loop. But prepending zeros to shorter integers... that's not so easy. The Linux printf command, with its % notation, is far more flexible than what we've got in cmd.exe. The most obvious way to do this is that which Hal prohibits, namely four FOR /L loops. But, our dominatrix Hal says no to four nested FOR loops. What can we do?

Well, I've got a most delightful little kludge to create leading zeros, and it only requires one FOR /L loop counter plus a little substring action like I discussed in Episode #48: Parse-a-Palooza. Here is the result:

c:\> cmd.exe /v:on /c "for /l %i in (10001,1,11300) do @set name=%i & set 
fullname=!name:~1,4!.jpg & dir !fullname! >nul 2>nul || echo !fullname! Missing"
0008.jpg Missing
0907.jpg Missing
1200.jpg Missing

Here, I'm launching a cmd.exe with /v:on to perform delayed variable expansion. That'll let my variables inside my command change as the command runs. Then, I have cmd.exe run the command (/c) of a FOR /L loop. That'll be an incrementing counter. I use %i as the iterator variable, counting from 10001 to 11300 in steps of 1. "But," you might think, "You are ten thousand too high in your counts." "Ah..." I respond, "That extra 10,000 gives me my leading zeros, provided that I shave off the 1 in front." And, that's just what I do. In the body of my FOR loop, I store my current iterator variable value of %i in a variable called "name". Remember, you cannot perform substring operations on iterator variables themselves, so we squirrel away their results elsewhere. I then introduce another variable called fullname, which is the value of name itself (when referring to delay-expanded vars, we use !var! and not %var%), but with a substring operation of (~1, 4), which means that I want characters starting at an offset of 1 and printing four characters (in this case digits). With offset counting starting at 0, we are shaving off that leading 1 from our ten-thousand-to-high iterator. I throw the output of my dir command away (>nul) as well as its standard error (2>nul).

Then, I use the || operator, which I mentioned in Episode #47, to run the echo command only if the dir command fails (i.e., there is no such file). I display the name of the missing file, and the word "Missing".

There are many other ways to do this as well, such as using IF NOT EXIST [filename]. But, my initial approach used dir with the || to match more closely Hal'sl use of &&. The IF statement is far more efficient, though, because it doesn't require running the dir command and disposing of its standard output and standard error. So, we get better performance with:
c:\> cmd.exe /v:on /c "for /l %i in (10001,1,11300) do @set name=%i &
set fullname=!name:~1,4!.jpg & IF NOT EXIST !fullname! echo !fullname!
Missing"
In the end, we have a method for generating leading zeros without using nested loops by instead relying on substring operations, all to make the rather unreasonable Mr. Pomeranz happy. :)