Tuesday, December 8, 2009

Episode #72: That Special Time of Year

Tim plays Santa:

A merry listener in the PaulDotCom IRC channel asked:
[Dear Santa] there a way to delete certain characters in a for loop from cmd.exe (such as nul, tab, etc)?

Santa slightly nods and exclaims, "Now, Dasher! Now, Dancer! Now, Prancer, and Vixen! On, Cmd! On, For Loop! On, Donner and PowerShell! To the top of the terminal! to the top of the screen! Now bash away! bash away! bash away all!"

Anyway, enough of that crazy old guy.

The question was how to do it from the standard Windows command line, but indulge me for a minute and let's see what PowerShell can do.

Santa has a text file containing the data below, the bits in brackets represent special characters that we want to remove.

So the goal is to delete these special characters. The question is, "How does PowerShell handle special characters?" The answer is the backtick character (by the Esc and 1 keys), and here is a list of special characters and the associated escape sequence:

`n New line
`r Carriage return
`t Tabulator
`a Alarm
`b Backspace
`' Single quotation mark
`" Double quotation mark
`0 Null
`` Backtick character

These can be used in our regular expression to remove the special characters from our file. Here is how we do it.

PS C:\> Get-Content test.bin | % { $_ -replace " |`t|\*|`'|`"| |\0x06|`0", "" }

Inside our ForEach-Object (alias %) we use the replace operator to find all of our special characters and replace them with nothing (a.k.a. deleted). For those of you not familiar with regular expressions, the pipe character (|) is a logical OR, so any/all the characters will be replaced with an underscore. To represent the ASCII ACK character (0x06) in the regular expression we use \xNN, where NN is the hexadecimal ASCII code.

We removed the special characters from the text that was read from the file, but we didn't actually change the file. Here is how we do that:

PS C:\> (Get-Content test.bin) | % { $_ -replace " |`t|\*|`'|`"| |\0x06|`0",""} | Set-Content test.bin

There is one very importantly subtlety that can be very easily overlooked. Notice the parentheses used in the first portion of the command. This is necessary so that all of the content is loaded in to memory before it is passed down the pipeline. If we don't do that the command will attempt to write to the file which it is currently reading and will throw an error.

PS C:\> Get-Content test.bin | % { $_ -replace " |`t|\*|`'|`"| |\0x06|`0",""} | Set-Content test.bin
Set-Content : The process cannot access the file 'C:\test.bin' because it is being used by another process.

It is pretty easy with PowerShell, now lets take a look at the Windows command line.

Windows command line

We will start off by using the same file as above; and we will use the standard Windows command line parser, the For Loop.

To see how the For Loop and variables handle the special characters, we will do a quick test of the For Loop without using any delimiters.

C:\> for /f "delims=" %a in ('type test.bin') do @echo %a
Grandma Got Runover*By'A"Reindeer[0x06]Last

Oh No! Notice that we lost the last word. This happened because in the Windows command line variables are null terminated, meaning that the NUL character is deemed to be the end of the string so nothing beyond it will be processed. So we can't work with the NUL character, first strike on Santa's list.

Now, lets try to remove those other pesky characters.

C:\> for /F "tokens=1-8 delims=*^' " %a in ('type test.bin') do @echo %a%b%c%d%e%f%g%h
GrandmaGot RunoverByA"Reindeer[0x06]Last

So we can't represent the tab character, the double quote, or the special character either! Usually the caret character can be used to escape special characters, like the single quote. But for some reason it won't work to escape the double quote. Second strike on Santa's list.

However, we do have a work around for the tab character. We can tell cmd to disable file and directory name completion characters so we can type the tab character. All we have to do is tell cmd to F off.

cmd.exe /F:off

Unfortunately, this can't be prepended to our other command and has to be a separate instance. But now we can type the tab character. All we have to do is add it as a delimiter and we are good to go.

C:\> for /F "tokens=1-8 delims=    *^' " %a in ('type test.bin') do @echo %a%b%c%d%e%f%g%h

One more problem, we can only remove so many characters from a line. Why? Because only the variables a-z are available to represent the tokens.

Given a file with this content:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

To remove the space we would have to use this command:

C:\> for /F "tokens=1-26" %a in ('type test.txt') do @echo %a%b%c%d%e%f%g%h%i%j%k%l%m%n%o%p%q%r%s%t%u%v%w%x%y%z

Uh Oh! We lost 27 and 28 since we don't have a way to represent them. Strike three on Santa's list.

We can preserve the rest of the line, but we can't remove the remaining "space" characters.

C:\> for /F "tokens=1-25*" %a in ('type test.txt') do @echo %a%b%c%d%e%f%g%h%i%j%k%l%m%n%o%p%q%r%s%t%u%v%w%x%y%z
1234567891011121314151617181920212223242526 27 28

In the above command the first 25 tokens are now represented by a-y. The 26th token is the remainder of the line, and is represented by z. Ugh!

Too bad cmd.exe is missing a nice easy way to do this, but there are a lot of things missing from cmd. I wish it had a compass in the stock and a thing which tells time, but if it did I would probably shoot my eye out with it.

Hal says 'Oh Holy... Night':

Um. Wow. After reading Tim's fearsome fu I am ever more thankful this holiday season that I spend the vast majority of my time working with the Unix shell. Oh and by the way, Tim, "Bash away, bash away, bash away all!" is my line.

Let's review some basics. Producing "special" characters in the Unix shell is generally pretty easy using some common conventions. On such convention is the use of backwhacks ("\") to protect or "escape" special characters like quoting characters from being interpolated by the shell. Backwhacks can also be used in special escape sequences such as "\t" and "\n" to represent characters like tab and newline. And failing all of those options you can use "\xnn" or "\nnn" to represent arbitrary ASCII codes in either hex or octal.

All of these conventions can be demonstrated with a simple "echo -e" command to output Tim's example input. The "-e" option is necessary here to force proper interpolation of the various backwhacked sequences:

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | cat -A
Grandma Got^IRunover*By'A"Reindeer^FLast^@Night.$

The space is represented by a literal space and the tab by "\t". I've enclosed the expression in double quotes so we don't need to backwhack either the "*" or the single quote "'", but we do need a backwhack in front of the "literal" double quote that we want to output (so that it doesn't terminate the double-quoted expression too early). The remaining special characters are produced using their ASCII values in hex and octal. I've piped the output into "cat -A" so that you can more easily see the special characters in the output.

The general quoting rules and escape sequences can vary slightly with different commands. For example, one way to strip characters from our sample input line is with "tr -d". However, while tr understands octal notation for specifying arbitrary characters, it doesn't handle the "\xnn" hex notation. This is not a huge deal, and we could just write:

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | tr -d " \t*'\"\006\000" | cat -A

Again I'm using "cat -A" to confirm that I really did strip out all the characters we wanted to remove.

If you really, really insist on having the "\xnn" hex escape sequence, you could use the special $'...' quoting notation in bash that forces interpolation using the same rules as "echo -e":

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | tr -d $' \t*\'"\x06\\000' | cat -A

Notice that I now had to backwhack the single quote in my list of characters, but I was able to drop the backwhack in front of the double quote. Also notice that while you can write "\x06" inside of $'...', you need double backwhacks in front of the octal.

Another way to remove characters from a line is with sed. However, while sed understands the "\xnn" hex notation, it doesn't grok specifying characters with octal:

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | sed "s/[ \t*\'\"\x06\\000]//g" | cat -A
$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | sed "s/[ \t*\'\"\x06\x00]//g" | cat -A

Even with these annoying little inconsistencies, life with bash, tr, and sed is infinitely preferable to the lump of coal my co-authors have to deal with.

So Merry \x-Mas to all, and to all a good night!