Tuesday, November 26, 2013

Episode #172: Who said bigger is better?

Tim sweats the small stuff

Ted S. writes in:

"I have a number of batch scripts which turn a given input file into a configurable amount of versions, all of which will contain identical data content, but none of which, ideally, contain the same byte content. My problem is, how do I, using *only* XP+ cmd (no other scripting - PowerShell, jsh, wsh, &c), replace the original (optionally backed up) with the smallest of the myriad versions produced by the previous batch runs?"

This is pretty straight forward, but it depends on what we want to do with the files. I assumed that the larger files should be deleted since they are redundant. This will leave us with only the smallest file in the directory. Let's start off by listing all the files in the current directory and sort them by size.

C:\> dir /A-D /OS /b
file3.txt
file2.txt
file1.txt
file4.txt

Sorting the files, and only files, in the current directory by size is pretty easy. The "/A" option filters on the object's properties and directories are filtered out with "-D". Next, the "/O" option is used to sort and the "S" tells the command to sort putting the smallest files first. Finally, the "/b" is used to show the bare format.

At this point we have the files in the proper order and in a nice light format. We can now use a For loop to delete everything while skipping the first file.

C:\> for /F "tokens=* skip=1" %i in ('dir /A-D /OS /b') do @del %i

Here is the same functionality in PowerShell:

PS C:\> Get-ChildItem | Where-Object { -not $_.PSIsContainer } | Sort-Object -Property Length | Select-Object -Skip 1 | Remove-Item

This is mostly readable. The only exception is the "PSIsContainer". Directories are container objects but files are not, so we filter out the containers (directories). Here is the same command shortented using aliases and positional parameters:

PS C:\> ls | ? { !$_.PSIsContainer } | sort Length | select -skip 1 | rm

There you go Ted, and in PowerShell even though you didn't want it. Here comes Hal brining something even smaller you don't want.

Hal's is smaller than Tim's... but less sweaty

Tim, how many times do I have to tell you, smaller is better when it comes to command lines:

ls -Sr | tail -n +2 | xargs rm

It's actually not that different from Tim's PowerShell solution, except that my "ls" command has "-S" to sort by size as a built-in. We use the "-r" flag to reverse the sort, putting the smallest file first and skipping it with "tail -n +2".

If you're worried about spaces in the file names, we could tart this one up a bit more:

ls -Sr | tail -n +2 | tr \\n \\000 | xargs -0 rm

After I use "tail" to get rid of the first, smallest file, I use "tr" to convert the newlines to nulls. That allows me to use the "-0" flag to "xargs" to split the input on nulls, and preserves the spaces in the input file names.

What may be more interesting about this Episode is the command line I used to create and re-create my files for testing. First I made a text file with lines like this:

1 3
2 4
3 1
4 2

And then I whipped up a little loop action around the "dd" command:

$ while read file size; do 
      dd if=/dev/zero bs=4K count=$size of=file$file; 
  done <../input.txt 
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 6.1259e-05 s, 201 MB/s
4+0 records in
4+0 records out
16384 bytes (16 kB) copied, 0.000144856 s, 113 MB/s
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 3.4961e-05 s, 117 MB/s
2+0 records in
2+0 records out
8192 bytes (8.2 kB) copied, 4.3726e-05 s, 187 MB/s

Then I just had to re-run the loop whenever I wanted to re-create my test files after deleting them.