Command Line Kung Fu: December 2009

Tuesday, December 29, 2009

Episode #75: Yule Be Wanting an Explanation Then

Hal returns to the scene of the crime

I opened last week's post saying there would be no "explanations or excuses", but apparently that wasn't good enough for some of you. So at the request of our loyal readers, we'd like to revisit last week's episode and explain some of the code. Especially that crazy cmd.exe stuff that Ed was throwing around.

Of course the bash code is completely straightforward:

$ ct=12; while read line; do
[ $ct == 1 ] && echo -n Plus || echo -n $ct;
echo " $line";
((ct--));
done <<EoLines
keyboards drumming
... lines removed ...
command line hist-or-y!
EoLines

First we're initializing the "ct" variable we'll be using to count down the 12 Days of Fu. Then we start a while loop to read lines from the standard input.

Within the body of the loop, we use the quick and dirty "[...] && ... || ..." idiom that I've used in previous Episodes as a shortcut instead of a full-blown "if ... then ... else ..." clause. Basically, if we've counted down to one then we want to output the word "Plus"; otherwise we just output the value of $ct. Notice we use "echo -n ..." so that we don't output a newline. This allows us to output the text we've read from the standard input as the remainder of the line. Finally, we decrement $ct and continue reading lines from stdin.

The interesting stuff happens after the loop is over. Yeah, I could have put the text into a file and read the file. But I was looking for a cool command-line way of entering the text. The "<<EoLines" syntax at the end of the loop starts what's called a "here document". Basically, I'm saying that the text I type in on the following lines should be associated with the standard input (of the while loop in this case). The input ends when the string "EoLines" appears by itself on a line. So all I have to do is type in the 12 lines of text for our 12 Days of Fu and then finish it off with "EoLines". After that we get our output. Neat!

Everybody clear? Cool. I now throw it over to Tim to get the lowdown on his PowerShell madness.

Tim goes back in time

Let's unwrap what we did last week.

PS C:\> $ct=12; "keyboards drumming
admins smiling
... lines removed ...
command line hist-or-y!".split("`n") |
% { if ($ct -eq 1) {"Plus $_"} else {"$ct $_"}; $ct-- }

Here is the break down:

First, we initialize the $ct variable to begin our count down.

ct=12;

That was easy. Next, we take a multi-line string and split it in to an array using the new line character (`n) as the delimiter.

"keyboards drumming
... lines removed ...
command line hist-or-y!".split("`n")

We could have just as easily read in content from a file using Get-Content, but I wanted to demonstrate some new Fu that was similar to Hal's Fu. The nice thing is that reading from a file would be an easy drop-in replacement.

Once we have an array of Fu Text, we pipe it into ForEach-Object so we can work with each line individually.

if ($ct -eq 1) {"Plus $_"} else {"$ct $_"}

Inside the ForEach-Object loop we use an IF statement to format the output. If the count is one, then we output "Plus" and the Fu Text, otherwise we output the value of $ct and the Fu Text.

Finally, we decrement the value of $ct.

$ct--

Simple right? That was pretty straightforward. But sit back, and grab some spiked Egg Nog before you proceed further.

Ed Surveys the Madcap Mayhem:
I really liked this episode because it required the use of a few techniques that we haven't yet highlighted in this blog before. Let's check them out, first reiterating that command:

c:\> copy con TempLines.txt > nul & cmd.exe /v:on /c "set ct=12& for /f
"delims=" %i in (TempLines.txt) do @(if not !ct!==1 (set prefix=!ct!) else (set prefix=
Plus)) & echo !prefix! %i & set /a ct=ct-1>nul" & del TempLines.txt
keyboards drumming
----snip----
command line hist-or-y!
^Z (i.e., hit CTRL-Z and then hit Enter)

OK... we start out with the copy command, which of course copies stuff from one file to another. But, we use a reserved name here for our source file: con. That'll pull information in from the console, line by line, dumping the results into a temporary file, very cleverly named TempLines.txt. Of course, there is the little matter of telling "copy con" when we're done entering input. While there are several ways to do that, the most efficient way to do so that has minimal impact on the contents of the file is to hit CTRL-Z and then Enter. Voila... we've got a file with our input. By the way, I throw away the output of "copy con" with a "> nul" because I didn't want the ugly "1 file(s) copied." message to spoil my output. By the way, it kinda stinks that that message is put on Standard Output, doesn't it? Where I come from, that is much more of a Standard Error thing. But, I'm sure we could get into a big philosophical debate about what should go on Std Out and what should go on Std Err. But, let's just cut the argument short and say that many Windows command line tools are all screwed up on this front, regardless of your philosophy. That's because there are no reasonable and consistent rules for directing stuff to Std Out versus Std Err in Windows command line output.

Anyway, back to the point. I then run "cmd.exe /v:on" so I can do delayed variable expansion. That'll let me use variables with values that can change as my command runs. Otherwise, cmd.exe will expand all variables to their values instantly when I hit Enter. I need to let these puppies float. I use the cmd.exe to execute a command, with the /c option, enclosing the command in double quotes.

So far, so good. But, now is where things get interesting.

To map to Hal and Tim's fu, I then set a variable called ct to a value of 12, just a simple little integer. But, what's with that & right after the 12? Well, if you leave it out, you'll see that the output will contain an extra space in "12 keyboards drumming"... it'll look like "12[space][space]keyboards drumming". Where does the extra space come from? Well, check this out:

c:\> cmd.exe /v:on /c "set var=foo & echo !var!bar"
foo bar

c:\> cmd.exe /v:on /c "set var=foo& echo !var!bar"
foobar

Do you see what's happening here? In the first command, the set command puts everything in the variable called var, starting right after the = and going up to the & (which separates the next command), and that includes the space! We have to omit that space by having [value] with the & right after it. For a more extreme example, consider this:

c:\> cmd.exe /v:on /c "set var=stuff      & echo !var!blah"
stuff      blah

I typically like to include a space before and after my & command separators when wielding my fu, for ease of readability. However, sometimes that extra space has meaning, so it has to be taken out, especially when used with the set command to assign a string to a variable, like the ct variable in that big command above.

Wait a second... earlier I referred to ct as an integer, and now I'm calling it a string? What gives? Just hold on a second... I'll come back to that in just a bit.

We have to deal with our FOR loop next. I'm using a routine FOR /F loop to iterate through the contents of the TempLines.txt file. I'm specifying custom delimiters, though, to override the default space and tab delimiters that FOR /F loops like to use. With "delims=", I'm specifying a delimiter of... what exactly? The equals sign? No... that's actually part of the syntax of specifying a delimiter. To make an equals a delimiter, I'd have to use "delims==". So, what is the delimiter I'm setting here? Well, friends, it's nothing. Yeah. I'm turning off parsing, because I want the full line of my input to be set to my iterator variable. In the past, when I was but a young cmd.exe grasshopper, I would turn off such parsing by setting a delim of a crazy character I would never expect in my input file, such as a ^, with the syntax "delims=^". But, I now realize that the most effective way to use FOR /F loops is to simply let them use you. I turn off parsing by making a custom delimiter of the empty set. Do not try and bend the spoon... That's impossible. Instead try to realize the truth.... that there is no spoon.

Anyway, so where was I? Oh yea, with my delimiterless lines now being assigned one by one to my iterator variable of %i, I'm off and running. In the DO clause of my FOR loop, I turn off the display of commands (@) and jump right into an IF statement, to check the value of my ct variable. I expand ct into its value with a !ct!, because I'm using delayed variable expansion. Without delayed expansion, variables are referred to a %var%. My IF statement checks to see if !ct! is NOT equal (==) to 1. If it's not, I set another variable called prefix to the value of ct.

Then, I get to my ELSE clause. Although I use ELSE a lot in my work, I have to say that this is the first time I've had to use it in one of our challenges on this blog. The important thing to remember with ELSE clauses in single commands (not in batch scripts) is that you have to put everything in parentheses. So, if !ct! is equal to 1, my ELSE clause kicks in and sets the prefix variable to the string "Plus". That way, later on, I can simply print out prefix, which will contain the ct number for most of the days, but the word "Plus" for the last day.

And, here we back to that string/integer thing I alluded to above. The cmd.exe shell uses loosely typed variables. No, this is not a reference to how hard you hit the keys on your keyboard when typing. Instead, like many interpreted languages, variable types (such as strings and integers) are not hard and fast. In cmd.exe, they are evaluated in real time based on context, and they can even change in a single command. My ct variable behaves like an integer, for the most part. I can add to it, subtract from it, and store its value in another variable. But, when I defined it originally with the set command, if I had used "set ct=12 &...", it would have been a string with the trailing space until I used it in some math, and then that space would disappear. Also, the prefix variable is given the value of ct most of the time, which is just an integer. But, when ct is one, I give the prefix variable a string of "Plus". I'm an old C-language programmer, so this loose type enforcement kinda weirds me out. Still, it's quite flexible.

Then, I echo the prefix (!prefix!) and the line of text (%i). I then subtract one from my ct with the "set /a ct=ct-1", throwing the output of the set command away (>nul). Note that I want to show the prefix and the text on the same line, so I use a single echo command to display both variables on the same line. Most cmd.exe command-line tools actually put their output on standard out with a Carriage Return Line Feed (CRLF) right afterward. Thus, two echo commands, one for each variable, would have broken the prefix and the file content on separate lines, a no-no when trying to reproduce exactly the output of Hal and Tim. When formulating commands that need to display a lot of different items on a single line, I often chunk them into variables and then echo them exactly as I need them on a single line with a single echo statement.

Now, there is one interesting counter-example to the general rule that cmd.exe command-line tools insert a CRLF at the end of their output: the "set /a" command. It does its math, and then displays the output without any extraneous return, as in:

c:\> set /a 8-500 & echo hey
-492hey

I used that little fact in this fun command to spew Matrix-like gibberish on the screen from Episode #58:

C:\> cmd.exe /v:on /c "for /L %i in (1,0,2) do @set /a !random!"

When I first was working on this 12-days challenge, I was thinking about using set /a to display !ct! and then the line from the file. It would all be on the same line because of that special "no-CRLF" property of "set /a". But, I ran into the little problem of the "Plus" for the last line of input, so I instead introduced the prefix variable and played on the loose typing. There are dozens of other ways to do this, but I tried to focus on one that, believe it or not, I thought made the most sense and was simplest.

Oh, and to close things out, I delete the TempLines.txt file. Can't litter our file system with crap, now can we?

So, as you can see, there were a bunch of ideas we haven't used in this blog so far that popped out of cmd.exe in this innocuous-seeming challenge, including empty-set delims, an ELSE clause, weak type enforcement, and variable building for a single line of output. That's a lot of holiday cheer, and it makes me happy.

With that said, all of us at the Command Line Kung Fu blog would like to wish our readers a Happy and Prosperous New Year!

Tuesday, December 22, 2009

Episode #74: Yule Love It!

Hal has indulged in a bit too much holiday cheer:

Presented for your enjoyment with no explanation or excuses:

$ ct=12; while read line; do
[ $ct == 1 ] && echo -n Plus || echo -n $ct;
echo " $line";
((ct--));
done <<EoLines
keyboards drumming
admins smiling
systems thrashing
networks crashing
hosts a-pinging
Windows versions
(billion) Linux distros
Windows loops!
authors coding
shells hacked
types of hosts
command line hist-or-y!
EoLines
12 keyboards drumming
11 admins smiling
10 systems thrashing
9 networks crashing
8 hosts a-pinging
7 Windows versions
6 (billion) Linux distros
5 Windows loops!
4 authors coding
3 shells hacked
2 types of hosts
Plus command line hist-or-y!

Tim got run over by a reindeer:

PS C:\> $ct=12; "keyboards drumming
admins smiling
systems thrashing
networks crashing
hosts a-pinging
Windows versions
(billion) Linux distros
Windows loops!
authors coding
shells hacked
types of hosts
command line hist-or-y!".split("`n") |
% { if ($ct -eq 1) {"Plus $_"} else {"$ct $_"}; $ct-- }
12 keyboards drumming
11 admins smiling
10 systems thrashing
9 networks crashing
8 hosts a-pinging
7 Windows versions
6 (billion) Linux distros
5 Windows loops!
4 authors coding
3 shells hacked
2 types of hosts
Plus command line hist-or-y!

Ed's Nuts Roasting on an Open Fire:


c:\> copy con TempLines.txt > nul & cmd.exe /v:on /c "set ct=12& for /f 
"delims=" %i in (TempLines.txt) do @(if not !ct!==1 (set prefix=!ct!) else (set prefix=
Plus)) & echo !prefix! %i & set /a ct=ct-1>nul" & del TempLines.txt
keyboards drumming
admins smiling
systems thrashing
networks crashing
hosts a-pinging
Windows versions
(billion) Linux distros
Windows loops!
authors coding
shells hacked
types of hosts
command line hist-or-y!
^Z (i.e., hit CTRL-Z and then hit Enter)

12 keyboards drumming
11 admins smiling
10 systems thrashing
9 networks crashing
8 hosts a-pinging
7 Windows versions
6 (billion) Linux distros
5 Windows loops!
4 authors coding
3 shells hacked
2 types of hosts
Plus command line hist-or-y!

Best wishes for a happy holiday season and a joyous and prosperous new year!

Tuesday, December 15, 2009

Episode #73: Getting the perfect Perm(s)

Tim unwraps:

One of the things I find myself doing on a regular basis is creating a new directory structure and setting the permissions. The permissions are different for each folder and are based on who in the organization needs access to it. We could just write a script to create the directories and the permissions, but let's say we want to copy permissions from one directory structure to another. For this example let's assume we have a project folder structure that looks something this.

Prjs
+-Project1 (Managers - Full Access, Consultants - Full Access)
  |-Budget (Consultants - Deny, Finance - Full Access)
  |-Data
   +-Docs
      |-ForRelease (AdminStaff - Full Access)
      +-InProgress

Included above is the appropriate permissions on each folder. All permissions are inherited, so consultants and managers would have access to the Data directory.

We can verify these permissions by using Get-ChildItem (aliases gci, dir, ls) and piping the results into Get-Acl.

PS C:\> ls Prjs -recurse | Get-Acl | fl Path,AccessToString

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project1
AccessToString : WINXP\Consultants Allow  FullControl
               WINXP\Managers Allow  FullControl

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project1\Budget
AccessToString : WINXP\Consultants Deny  DeleteSubdirectoriesAndFiles, Modify,
               ChangePermissions, TakeOwnership
               WINXP\Consultants Allow  FullControl
               WINXP\Finance Allow  FullControl
               WINXP\Managers Allow  FullControl

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project1\Data
AccessToString : WINXP\Consultants Allow  FullControl
               WINXP\Managers Allow  FullControl

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project1\Docs
AccessToString : WINXP\Consultants Allow  FullControl
               WINXP\Managers Allow  FullControl

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project1\Docs\ForRelease
AccessToString : WINXP\AdminStaff Allow  FullControl
               WINXP\Consultants Allow  FullControl
               WINXP\Managers Allow  FullControl

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project1\Docs\Working
AccessToString : WINXP\Consultants Allow  FullControl
               WINXP\Managers Allow  FullControl

So now we want to create a second project, Project2, and we want to make sure we have the same permissions. We could copy just the directories without files, but there may be more subdirectories further down that we don't want. So let's create the folders.

PS C:\> mkdir Prjs\Project2\Budget
PS C:\> mkdir Prjs\Project2\Data
PS C:\> mkdir Prjs\Project2\Docs\ForRelease
PS C:\> mkdir Prjs\Project2\Docs\Working

Note, when the Budget directory is created it also creates the Project2 directory since it doesn't exist.

What are the permissions on the new folder?

PS C:\Prjs> Get-Acl Project2 | fl Path,AccessToString

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project2
AccessToString : BUILTIN\Administrators Allow  FullControl
               NT AUTHORITY\SYSTEM Allow  FullControl
               WINXP\myuser Allow  FullControl
               CREATOR OWNER Allow  268435456
               BUILTIN\Users Allow  ReadAndExecute, Synchronize
               BUILTIN\Users Allow  AppendData
               BUILTIN\Users Allow  CreateFiles

Those are not the permissions we want. The permissions need to be copied from Project1 to Project2, but how? The Get-Acl and Set-Acl commands will do it.

PS C:\Prjs> Get-Acl Project1 | Set-Acl Project2

Let's verify.

PS C:\Prjs> Get-Acl Project2 | fl Path,AccessToString

Path           : Microsoft.PowerShell.Core\FileSystem::C:\Prjs\Project2
AccessToString : BUILTIN\Administrators Allow  FullControl
               WINXP\Managers Allow  FullControl
               WINXP\Consultants Allow  FullControl

Looks good. Now the subfolder permissions need to be copied as well.

PS C:\Prjs> ls Project1 -Recurse | Get-Acl |% {
Set-Acl $_ -Path ($_.Path -replace "Project1","Project2") }

First we do a recursive directory listing and get the Acl on each folder. We then take that Acl and apply it to a different folder. In our case all we need to do is replace Project1 for Project2 in the Path. Let's verify that the permissions match.

PS C:\Prjs> Compare-Object (ls Project1 -Recurse | Get-Acl)
(ls Project2 -Recurse | Get-Acl) -Property PSChildName, Access

No output, that's good, it means the permissions are identical. How did that work?

The Compare-Object cmdlet is used to find the differences between the collection of objects returned by these two commands:

ls Project1 -Recurse | Get-Acl
ls Project2 -Recurse | Get-Acl

The Property parameter specified in the original command allows us to select the properties to be checked for differences. PSChildName is the directory name and the Access property contains the permissions on the folder. We can't substitute the Path property for PSChildName since Path is the full path and it would always be different.

Copying permissions is pretty easy, I imagine it will be pretty easy for Hal since it isn't as granular. Finally, a bit of a leg up on Hal.

Hal just copies everything:

Do I detect a trace of jealousy and bitterness in my colleague's last comments? Better fix up that attitude Tim, or there will be nothing but coal in your stocking this year.

It's interesting that Tim brings up this subject, because it's another case where the differences in philosophy between Windows and Unix are apparent. In Windows, you need to fix up your directory permissions with an external tool after you copy the files. In Unix, it's just a natural part of the file copying operation-- particularly if you're doing the copy as the superuser.

This is also an area where we've seen some historical evolution in Unix-like operating systems. When I first got started with Unix in the 1980's, the "cp" command didn't have a "-p" option to preserve permissions, ownerships, and timestamps. The way you would copy directories when you wanted to preserve directory permissions was with the so-called "tar cp" idiom (actually, real old-timers will remember doing this with cpio):

# cd olddir
# tar cf - . | (cd /path/to/newdir; tar xfp -)

Here we're running the first tar command to create ("c") a new archive from the current working directory (".") and write it to the standard output ("f -"). We pipe that output to a subshell that first changes directories to our target dir and then runs another tar command to unpack the incoming archive on the standard input. The "p" option means preserve permissions, timestamps, and ownerships. Actually "p" is normally the default if you're running the tar command as root, so you can leave it off, but I prefer being explicit.

These days, however, there are a couple of simpler options. Obviously, you could just use "cp -p":

# cp -Rp olddir /path/to/newdir

I generally prefer rsync though:

# rsync -aH olddir /path/to/newdir

rsync not only allows you to copy directories within the same system, but also gives you the option of copying directories across the network. Also, if you just want to update the permissions on a directory, the rsync command will do that and not actually copy any file data that has previously been copied. For more information on rsync, see Episode #24.

One issue that Tim brought up was that sometimes you want to copy only part of a directory structure, but exclude certain files and/or subdirectories. This is another place where rsync beats cp. rsync has a couple of different ways of excluding content: the --exclude option for specifying patterns to exclude on the command line, and --exclude-from for reading in a list of patterns to exclude from a file. There's no way of excluding files built into the cp command at all. For those old fogies like me out there who still occasionally use "tar cp", the tar command typically has a switch like -X to exclude files and directories from the archive, and GNU tar has --exclude options very similar to rsync.

One thing you do need to be careful with for all of these copy options, however, is that they may not copy special permissions like extended attributes or file ACLs by default. Both cp and rsync have explicit options you can set to preserve these settings:

# cp -R --preserve=all olddir /path/to/newdir
# rsync -aHAX olddir /path/to/newdir

There's no way to do something similar with the "tar cp" idiom, because the tar archive format doesn't preserve extended attributes and ACLs.

Oh dear. Now it's Ed's turn. I hope Tim and I haven't spoiled his holiday cheer...

Ed Joyously Responds:

Ahhhh…. file permissions. They tend to be an absolute pain in the neck to deal with en masse in cmd.exe. Sure, we can use cacls or icacls to manipulate them on single files or directories just swell. But, synchronizing or applying changes to bunches of files using cacls or icacls is often dangerous and painful. When I first read Tim's challenge, I thought to myself, "This is gonna get ugly on us… as ugly as that feud between Snow Miser and Heat Miser." I immediately began to search my mind for a hook or trick to make this a lot easier, hoping to avoid a trip to visit Mother Nature.

Then, it hit me: we can use our little friend robocopy, the wonderfully named tool in Vista, Windows 7, and Windows 2008! Yeah, it's not built in to XP or Windows 2003, but it'll work for the latest version of Windows. We talked about robocopy in Episode #24.

To address Tim's challenge, I'm going to assume that the directory structure where we want to replicate our file permissions does not already exist, avoiding the mkdir commands Tim uses. Robocopy will make those for us, dutifully placing the proper access controls on them if we run:

C:\> robocopy [directory1] [directory2] /e /xf *

All of the subdirectories in directory1 will be created in directory2 with the same file permissions. The /e will make it recurse through those subdirectories, copying both directories with stuff in them and empty directories. The /xf means I want to exclude a certain set of files, which I've selected as *, meaning to exclude all files -- Only directories will be copied, including all of their luscious permissions.

Well, that's all fine and good, but what about Windows XP and 2003? Well, you can download and install robocopy on either of them, which is a pretty good idea. Alternatively, there is a way to trick Windows into applying the permissions from one directory structure to another, which applies to Windows 2003, Vista, 7, and 2008 Server. For this trick, we'll use the handy /save and /restore feature of icacls. Here, let's follow Tim's lead, and assume that we've got directory1 and directory2 already created, and we want to take the permissions from directory1 and its subdirectories and apply them to the already-existing directory2. Check out the following command:

C:\> icacls [directory1]\* /save aclfile /t /c

This command tells Windows to run icacls against directory1 and all of its contents (*), saving the results (/save) in a file called aclfile, recursing through the directory structure (/t), not stopping when it hits a problem (/c). Now, the resulting aclfile is not regular ASCII, but instead a unicode format that includes all of the permissions for the directories _and_ files inside of directory1.

Now, if there is a directory2 that already exists and has a similar overall structure to directory1, but perhaps without having any files in it, we can use icacls to restore the aclfile on a different directory! Wherever there is commonality in the directory structure, the permissions from directory1 will be used to overwrite the permissions on the given entity in directory2. The command to use is:

C:\> icacls [directory2] /restore aclfile /t /c

Voila! We've restored the ACLs from directory1 onto directory2! Now, that is a delicious holiday treat.

But, that leaves out poor little Windows XP, an operating system without robocopy and icacls built in. Sad, sad, sad little XP. Looks like it gets a lump of coal in its stocking this year, not only from Santa-Ed, but also from Microsoft, which has announced its impending withdrawal of support of this very near and dear friend.

Tuesday, December 8, 2009

Episode #72: That Special Time of Year

Tim plays Santa:

A merry listener in the PaulDotCom IRC channel asked:
[Dear Santa]...is there a way to delete certain characters in a for loop from cmd.exe (such as nul, tab, etc)?

Santa slightly nods and exclaims, "Now, Dasher! Now, Dancer! Now, Prancer, and Vixen! On, Cmd! On, For Loop! On, Donner and PowerShell! To the top of the terminal! to the top of the screen! Now bash away! bash away! bash away all!"

Anyway, enough of that crazy old guy.

The question was how to do it from the standard Windows command line, but indulge me for a minute and let's see what PowerShell can do.

Santa has a text file containing the data below, the bits in brackets represent special characters that we want to remove.

Grandma[space]Got[tab]Runover[*]By[']A["]Reindeer[0x06]Last[nul]Night.

So the goal is to delete these special characters. The question is, "How does PowerShell handle special characters?" The answer is the backtick character (by the Esc and 1 keys), and here is a list of special characters and the associated escape sequence:


`n  New line
`r  Carriage return
`t  Tabulator
`a  Alarm
`b  Backspace
`'  Single quotation mark
`"  Double quotation mark
`0  Null
``  Backtick character

These can be used in our regular expression to remove the special characters from our file. Here is how we do it.

PS C:\> Get-Content test.bin | % { $_ -replace " |`t|\*|`'|`"| |\0x06|`0", "" }
GrandmaGotRunoverByAReindeerLastNight.

Inside our ForEach-Object (alias %) we use the replace operator to find all of our special characters and replace them with nothing (a.k.a. deleted). For those of you not familiar with regular expressions, the pipe character (|) is a logical OR, so any/all the characters will be replaced with an underscore. To represent the ASCII ACK character (0x06) in the regular expression we use \xNN, where NN is the hexadecimal ASCII code.

We removed the special characters from the text that was read from the file, but we didn't actually change the file. Here is how we do that:

PS C:\> (Get-Content test.bin) | % { $_ -replace " |`t|\*|`'|`"| |\0x06|`0",""} | Set-Content test.bin

There is one very importantly subtlety that can be very easily overlooked. Notice the parentheses used in the first portion of the command. This is necessary so that all of the content is loaded in to memory before it is passed down the pipeline. If we don't do that the command will attempt to write to the file which it is currently reading and will throw an error.

PS C:\> Get-Content test.bin | % { $_ -replace " |`t|\*|`'|`"| |\0x06|`0",""} | Set-Content test.bin
Set-Content : The process cannot access the file 'C:\test.bin' because it is being used by another process.

It is pretty easy with PowerShell, now lets take a look at the Windows command line.

Windows command line

We will start off by using the same file as above; and we will use the standard Windows command line parser, the For Loop.

To see how the For Loop and variables handle the special characters, we will do a quick test of the For Loop without using any delimiters.

C:\> for /f "delims=" %a in ('type test.bin') do @echo %a
Grandma Got     Runover*By'A"Reindeer[0x06]Last

Oh No! Notice that we lost the last word. This happened because in the Windows command line variables are null terminated, meaning that the NUL character is deemed to be the end of the string so nothing beyond it will be processed. So we can't work with the NUL character, first strike on Santa's list.

Now, lets try to remove those other pesky characters.

C:\> for /F "tokens=1-8 delims=*^' " %a in ('type test.bin') do @echo %a%b%c%d%e%f%g%h
GrandmaGot      RunoverByA"Reindeer[0x06]Last

So we can't represent the tab character, the double quote, or the special character either! Usually the caret character can be used to escape special characters, like the single quote. But for some reason it won't work to escape the double quote. Second strike on Santa's list.

However, we do have a work around for the tab character. We can tell cmd to disable file and directory name completion characters so we can type the tab character. All we have to do is tell cmd to F off.

cmd.exe /F:off

Unfortunately, this can't be prepended to our other command and has to be a separate instance. But now we can type the tab character. All we have to do is add it as a delimiter and we are good to go.

C:\> for /F "tokens=1-8 delims=    *^' " %a in ('type test.bin') do @echo %a%b%c%d%e%f%g%h
GrandmaGotRunoverByA"Reindeer[0x06]Last

One more problem, we can only remove so many characters from a line. Why? Because only the variables a-z are available to represent the tokens.

Given a file with this content:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

To remove the space we would have to use this command:

C:\> for /F "tokens=1-26" %a in ('type test.txt') do @echo %a%b%c%d%e%f%g%h%i%j%k%l%m%n%o%p%q%r%s%t%u%v%w%x%y%z
1234567891011121314151617181920212223242526

Uh Oh! We lost 27 and 28 since we don't have a way to represent them. Strike three on Santa's list.

We can preserve the rest of the line, but we can't remove the remaining "space" characters.

C:\> for /F "tokens=1-25*" %a in ('type test.txt') do @echo %a%b%c%d%e%f%g%h%i%j%k%l%m%n%o%p%q%r%s%t%u%v%w%x%y%z
1234567891011121314151617181920212223242526 27 28

In the above command the first 25 tokens are now represented by a-y. The 26th token is the remainder of the line, and is represented by z. Ugh!

Too bad cmd.exe is missing a nice easy way to do this, but there are a lot of things missing from cmd. I wish it had a compass in the stock and a thing which tells time, but if it did I would probably shoot my eye out with it.

Hal says 'Oh Holy... Night':

Um. Wow. After reading Tim's fearsome fu I am ever more thankful this holiday season that I spend the vast majority of my time working with the Unix shell. Oh and by the way, Tim, "Bash away, bash away, bash away all!" is my line.

Let's review some basics. Producing "special" characters in the Unix shell is generally pretty easy using some common conventions. On such convention is the use of backwhacks ("\") to protect or "escape" special characters like quoting characters from being interpolated by the shell. Backwhacks can also be used in special escape sequences such as "\t" and "\n" to represent characters like tab and newline. And failing all of those options you can use "\xnn" or "\nnn" to represent arbitrary ASCII codes in either hex or octal.

All of these conventions can be demonstrated with a simple "echo -e" command to output Tim's example input. The "-e" option is necessary here to force proper interpolation of the various backwhacked sequences:

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | cat -A
Grandma Got^IRunover*By'A"Reindeer^FLast^@Night.$

The space is represented by a literal space and the tab by "\t". I've enclosed the expression in double quotes so we don't need to backwhack either the "*" or the single quote "'", but we do need a backwhack in front of the "literal" double quote that we want to output (so that it doesn't terminate the double-quoted expression too early). The remaining special characters are produced using their ASCII values in hex and octal. I've piped the output into "cat -A" so that you can more easily see the special characters in the output.

The general quoting rules and escape sequences can vary slightly with different commands. For example, one way to strip characters from our sample input line is with "tr -d". However, while tr understands octal notation for specifying arbitrary characters, it doesn't handle the "\xnn" hex notation. This is not a huge deal, and we could just write:

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | tr -d " \t*'\"\006\000" | cat -A
GrandmaGotRunoverByAReindeerLastNight.$

Again I'm using "cat -A" to confirm that I really did strip out all the characters we wanted to remove.

If you really, really insist on having the "\xnn" hex escape sequence, you could use the special $'...' quoting notation in bash that forces interpolation using the same rules as "echo -e":

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | tr -d $' \t*\'"\x06\\000' | cat -A
GrandmaGotRunoverByAReindeerLastNight.$

Notice that I now had to backwhack the single quote in my list of characters, but I was able to drop the backwhack in front of the double quote. Also notice that while you can write "\x06" inside of $'...', you need double backwhacks in front of the octal.

Another way to remove characters from a line is with sed. However, while sed understands the "\xnn" hex notation, it doesn't grok specifying characters with octal:

$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | sed "s/[ \t*\'\"\x06\\000]//g" | cat -A
GrandmaGotRunoverByAReindeerLast^@Night.$
$ echo -e "Grandma Got\tRunover*By'A\"Reindeer\x06Last\000Night." | sed "s/[ \t*\'\"\x06\x00]//g" | cat -A
GrandmaGotRunoverByAReindeerLastNight.$

Even with these annoying little inconsistencies, life with bash, tr, and sed is infinitely preferable to the lump of coal my co-authors have to deal with.

So Merry \x-Mas to all, and to all a good night!

Tuesday, December 1, 2009

Episode #71: Joining Up

Hal fields a question from IRC

Mr. Bucket passed along the following query from the PaulDotCom IRC channel:

What functionality is available to loop through multiple files, and write the output to a single file with some values on the same line? Ex: If one program gives me the hash of a file, and the other program outputs the name/size/etc of a file, can I output to the same file HASH-FileName-Size

I couldn't resist chortling with glee when this question came up, because it's another one of those "easy for Unix, hard for Windows" kinds of tasks. I never can resist sharing these "learning experiences" with my fellow co-authors.

First let's review our inputs. I'm going to use the openssl utility for generating checksums, since it's fairly generic to lots of different flavors of Unix at this point:

$ openssl sha1 *
SHA1(001.jpg)= a088531884ee5eb520e98b3e9e18283f29e13d25
SHA1(002.jpg)= 77febb1498b2926ee6a988c97f3457e38736456d
SHA1(003.jpg)= 922bcb001d025d747c2ee56328811a4270b62079
...

As you can see, it's pretty easy to generate a set of checksums over my directory of image files, but there's a bunch of cruft around the filename that's not really helpful. So let me get rid of that with some quick sed action:

$ openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/'
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d
003.jpg 922bcb001d025d747c2ee56328811a4270b62079
...

That's better! In the sed expression I'm using the "(.*)" sub-expressions to match the file name and the checksum in each line, and the substitution operator is replacing the original line with just the values of the sub-expressions. Slick.

Now that we've got the checksums, how do we produce the file sizes? I could just use "ls -l" of course. But since the questioner seems to only want "HASH-FileName-Size", I may as well just use "wc -c" to produce simpler output:

$ wc -c *
4227504 001.jpg
4600982 002.jpg
4271719 003.jpg
...

Now that I know what my inputs are going to be, the question is how to stitch them together? Luckily, Unix includes the join command for putting files together on arbitrary fields (we last saw the join command back in Episode #43). Now I could save the checksum output and the file sizes to separate files and then join the contents of the two files, but bash actually gives us a cooler way to handle this:

$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *)
001.jpg a088531884ee5eb520e98b3e9e18283f29e13d25 4227504
002.jpg 77febb1498b2926ee6a988c97f3457e38736456d 4600982
003.jpg 922bcb001d025d747c2ee56328811a4270b62079 4271719
...

See the "<(...)" syntax? That's a little bit of bash file descriptor magic that allows us to substitute the output of a command in a place where a program would normally be looking for a file name. In this case it saves us the hassle of having to create intermediate output files to join together. The join command itself is pretty simple. We're telling the program to join the output of the two commands using the file names in the first field of input #1 and the second field of input #2. The only problem is that the join command isn't producing the "HASH-FileName-Size" output that the original questioner wanted. That's because join always outputs the joined field first, followed by the remaining fields from the first input (the checksum in this case), followed by the remaining fields from the second input (the file size). We'll have to use a little awk fu to re-order the fields:

$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *) \
     | awk '{print $2 " " $1 " " $3}'
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...

Mmmm, that's a tasty little bit of shell magic, isn't it? Let's see what Ed and Tim are cooking up.

Loyal reader Jeff Haemer points out that you don't need awk if you understand how to work join's "-o" option to select your output fields:

$ join -1 1 -2 2 -o 1.2,0,2.1 <(openssl sha1 * | sed -r 's/SHA1$(.*)$= (.*)/\1 \2/') <(wc -c *)
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719
...

Yep, join actually lets you select specific fields from each input file and specify the order you want them output in. Nice. Thanks, Jeff!

Ed retorts snidely:
Choosing a topic just because you think it's hard for us Windows guys, huh, Hal? Well, aren't you just a big ball of sunshine, a command-line Scrooge this holiday season? When I first read this one, I though... "Ugh... this is gonna be hard." Perhaps I was psyched out by your juvenile trash talk. Or, maybe I've just been hanging around in cmd.exe too long, and have gotten used to hard problems.

But, this one turned out to be surprisingly straight-forward and even non-ugly (well, beauty is in the eye of the beholder, I suppose). Here's the skinny:

C:\> FOR /f "tokens=1-2" %a in (name-hash.txt) do @for /f "tokens=1,2" %m
in (length-name.txt) do @if %a==%n echo %b %a %m
a088531884ee5eb520e98b3e9e18283f29e13d25 001.jpg 4227504
77febb1498b2926ee6a988c97f3457e38736456d 002.jpg 4600982
922bcb001d025d747c2ee56328811a4270b62079 003.jpg 4271719

I'm assuming that name-hash.txt contains, well, names and hashes, one pair per line. Likewise, length-name.txt contains lengths and names, again one pair per line.

As we know, FOR /F loops can parse through all kinds of crap, including the contents of files. I use a FOR /F loop with two tokens (giving me two variables) of %a (for the file name) and %b (allocated automagically, holding the hash). For each of the files described in name-hash.txt, I then construct the body of my FOR loop. It contains another FOR /F loop, again with two variables (the original question mentioned "etc" for extra stuff there... if you have more stuff, just up the number of tokens and echo the proper variables at the end). My inner FOR /F loop iterates through the length-name.txt file, placing its values in the variables %m (length) and %n (name).

Now, if I just echoed out %a %b %m %n, I'd be making all of the possible combinations of every pair of two lines in the original files. But, we want to pare that down. We only want to generate some output if the name from name-hash.txt (%a) matches the name from length-name.txt (%n). We do this with a little IF operation comparing the two variables. If they match, we then echo out hash (%b), name (%n), and size (%m).

Admittedly, the performance of this little command isn't great, as I have to run through every line of name-hash.txt, comparing the name by running through the entirety of length-name.txt. I don't stop when I've found a match, because, well, there could be another match somewhere. Also, if there is no match of the name between the two files, my command ignores that name, not issuing any output. But, I think that makes sense given what the questioner asks.

So, Tim... does PowerShell have a nifty little built-in or something to make this easier than running through a couple of FOR loops? Inquiring minds what to know.

Tim tags in for Ed:

For loops! We don't need no stinking For loops!

The first thing to do is import the files. Since there is a space between the columes we can use Import-CSV with a delimiter of the space character. Also, there is no header information so we have to specify it.

PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data
File      Data
----      ----
001.jpg   4227504
002.jpg   4600982
003.jpg   4271719
001.jpg   a088531884ee5eb520e98b3e9e18283f29e13d25
002.jpg   77febb1498b2926ee6a988c97f3457e38736456d
003.jpg   922bcb001d025d747c2ee56328811a4270b62079
...

We have all the data, so now it can be grouped by the file name using Group-Object (alias group).

PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data | group file

Count Name      Group
----- ----      -----
    2 001.jpg   {@{File=001.jpg; Data=4227504}, @{File=001.jpg; Data=a088531884ee5eb520e98b3e9e18283f29e13d25}}
    2 002.jpg   {@{File=002.jpg; Data=4600982}, @{File=002.jpg; Data=77febb1498b2926ee6a988c97f3457e38736456d}}
    2 003.jpg   {@{File=003.jpg; Data=4271719}, @{File=003.jpg; Data=922bcb001d025d747c2ee56328811a4270b62079}}
    ...

We have the data grouped like we want, but we still need to massage it a bit so we can get the formate we want.

PS C:\> Import-Csv length.txt,hash.txt -Delimiter " " -Header File,Data |
  group file | Select @{Name="Hash";Expression={$_.Group[1].Data}}, Name,
  @{Name="Length";Expression={$_.Group[0].Data}}
Hash                                        Name       Length
----                                        ----       ------
a088531884ee5eb520e98b3e9e18283f29e13d25    001.jpg    4227504
77febb1498b2926ee6a988c97f3457e38736456d    002.jpg    4600982
922bcb001d025d747c2ee56328811a4270b62079    003.jpg    4271719
...

The Select-Object (alias select) cmdlet allows for custom expressions which was used to get the hash and the length. The "Group" object contains multiple items and each can be access by its index value, 0 is the length and 1 is the hash.

Fileless PowerShell

The initial task was to get the file name, length, and hash from separate files and combine them in to one. Let's try this again without using files.

This would be very easy if powershell just had a hashing cmdlet, but it doesn't. However, we can do hashing by using the .NET library and some very ugly PowerShell. Maybe in v3 we will get a Get-Hash cmdlet, but it seems as likely as the addition of Get-Unicorn or Get-MillionDollars.

So we need some hash, but not the kind that is illegal in 49 states, we need the hash of a file. Here is how we get it.

PS C:\> PS C:\> gci 001.jpg | % { (New-Object System.Security.Cryptography
  .SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) }

We use the SHA1CryptoServiceProvider .NET class, but it adds another bump since it doesn't take files as input and will only take a stream. It isn't hard to get the stream though, all we need to use is the OpenRead method of our file object. If that wasn't enough, there is another problem, the output.

PS C:\> PS C:\> gci 001.jpg | % { (New-Object System.Security.Cryptography
  .SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) }
160
136
83
24
...

The result is an array of bytes. So we have to convert that to hex and combine it together.

PS C:\> gci 001.jpg | % {$hash=""; (New-Object System.Security.Cryptography
  .SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) | % { $hash += $_.ToString("X2") }; $hash}
a088531884ee5eb520e98b3e9e18283f29e13d25

We use the ToString method with the format string X2 to convert each byte to hex. The X converts it to hex, and the 2 will make sure the output is two characters wide (0A vs A). We then use the variable $hash to stitch our bytes together to get the full hash.

Now let's see the full command.

PS C:\> gci *.* | select @{Name="Hash";Expression={$hash=""; (New-Object
  System.Security.Cryptography.SHA1CryptoServiceProvider).ComputeHash($_.OpenRead()) |
  % { $hash += $_.ToString("X2") }; $hash}}, name, length
Hash                                        Name       Length
----                                        ----       ------
a088531884ee5eb520e98b3e9e18283f29e13d25    001.jpg    4227504
77febb1498b2926ee6a988c97f3457e38736456d    002.jpg    4600982
922bcb001d025d747c2ee56328811a4270b62079    003.jpg    4271719
...

The first thing we do is get all the files in the currect directory using Get-ChildItem (aliased as gci or dir). That is piped in to Select-Object (aliased as select) to get the hash, filename, and size. The Select-Object cmdlet allows us to get properties of the pipeline object as well as creating a custom expression. In our case we will use the custom expression to calculate the hash.

Our results are in object form and can be piped to a file with Out-File or Out-Csv.

So the task is complete, but let's pretend for a second we had the fictional Get-Hash cmdlet. If we had our leprachaun our command might look something like this:

PS C:\> gci *.* | select @{Name="Hash";Expression={Get-Hash $_ sha1}, name, length

If only getting hash was easier in Windows.

Pages