Tuesday, November 29, 2011

Episode #163: Pilgrim's Progress

Tim checks the metamail:

I hope everyone had a good Thanksgiving. I know I did, and I sure have a lot to be thankful for.

Today we receive mail about mail. Ed writes in about Rob VandenBrink writing in:


Rob VandenBrink sent me a cool idea this morning. It's for printing out a text-based progress indicator in cmd.exe. The idea is that if you have a loop that's doing a bunch of stuff, without any indication to the user, you can just echo a dot on the screen at each iteration of the loop to show that you are still alive and have processed another iteration. The issue in cmd.exe is echoing the dot without a CRLF, so that it goes nicely across the screen. Here's Rob's approach (which uses set /p very cleverly to define a variable, but without using that variable). On a few episodes, I used set /a because of its nice property of doing math without a CRLF. Here, Rob uses set /p to avoid the CRLF.

C:\> for /L %i in (1,1,5) do @timeout 1 >nul & <nul (set /p z=.)
..... <-- Progress Dots

Just replace the timeout command with something useful, and vary the FOR iterator loop to something that makes sense.

Worthy of an episode?

Well Ed, you can tell Rob that it is. Rob, feel free to send more cool suggestions to Ed and Ed can send them to us. I'll pass along what I think is worthy of an episode to Hal, so Rob talk to Ed, Ed talk to Tim, Tim talks to Hal, Hal responds to Tim, Tim responds to Ed, Ed responds to Rob. Unless Ed is unavailable, then we Rob should find Tim who will check for Ed, then...

Ok, we'll work out those details later. We certainly need to keep a strict flow of information, or else it could confusing, and we would hate that.

The trick with this command is using the /P switch with the Set command. The /P switch is used to prompt a user for input. The standard syntax looks like this:

SET /P variable=[promptString]

We are using a dummy variable Z to receive input, and the promptString is our dot. We feed NUL into the Set command so it doesn't hang while waiting for input. Since we didn't provide a carriage return the prompt is not advanced to the next line. That way we can output multiple dots on the same line. To prevent extra spaces between the dots we need to make sure there are no spaces between the dot and the next character, whether it be a closing parenthesis or a input redirection (<).

I typically write it a little differently so it is a little clearer that the NUL is being fed into Set, but the effect is the same.

C:\> (set /P z=.<NUL) & (set /P z=.<NUL)


One of the best practices of PowerShell is to write each command so the output can be used as input to another command. This means that dots would mess up our nice object-type output. That's no skin off our back, as we have a cmdlet to keep track of progress for us, Write-Progress. However, it does require a bit of knowledge as to the number of iterations we will go through. Not usually a big deal though, but it may require that some input be preloaded so this calculation can be preformed. There are all sorts of cool things we can do with this cmdlet. Examples of coolness include: multiple progress bars, displaying time remaining, and displaying extra information on the current operation.

[ooooooooooooooooooooooooooooooooooooo ]

PS C:\> 1..100 | % { sleep 1; Write-Progress -Activity Test -Status Working -PercentComplete $_ }

The Activity and Status parameters are used to provide additional information. The Activity is usually used to provide a high level description of the process and the Status switch is used to describe the current operation, assuming there are multiple. Similar to our the cmd.exe command, replace the sleep 1 with something useful.

The SecondsRemaining parameter can be used to display the estimated time remaining. This time must be calculated by the author of the script or command, and since these calculations are never close to correct, I personally refuse to ever even try to calculate the remaining time. So enough of my rant and back to the task at hand.

Multiple progress bars can be used by using a unique ID for each. The default ID is 0, so we can use 1 for the second progress bar.

[oooooooooooooooooo ]
[oooooooooooooooooooooooooooooooooooooooooooooooooooooo ]

PS C:\> 1..100 | % { Write-Progress -Activity Testing -Status Outer -PercentComplete $_;
1..100 | % { sleep 0.5; Write-Progress -Activity Testing -Status Inner -PercentComplete $_ -ID 1 }

Another bonus is that the progress bar is displayed at the top of the screen so it doesn't interfere with the most recent output. To make it even better, it disappears after the command has completed. We have a progress display and we don't have any messy output to clean up, awesome!

The cmd.exe output is functional, but not great, and the PowerShell version is really pretty. My bet is Hal and his *nix fu is going to snuggle up between these two. Hal, snuggle away.

Hal emerges from a food coma

Don't even get me started about "snuggling" with Tim. Mostly he just rolls over and goes to sleep, leaving me with the aftermath. He never cares about my feelings or what's important to me...

Oh, sorry. I forgot we were talking command line kung fu here. I'm not going to be getting much "snuggling" on that front either, as it turns out. The Linux options pretty much emulate the two choices that Tim presented on the Windows side.

The portable method uses shell built-ins and looks a lot like the CMD.EXE solution. Here's an example using the while loop from last week's Episode:

paste <(awk '{print; print; print; print}' users.txt) passwords.txt |
while read u p; do
mount -t cifs -o domain=mydomain,user=$u,password=$p \
//myshare /mnt >/dev/null 2>&1 \
&& echo $u/$p works && umount /mnt
(( $((++c)) % 100 )) || echo -n . 1>&2
done >working-passwords.txt

The code I've added in bold face is the part that prints dots to show progress. I've got a variable "$c" that gets incremented each time through the loop. We then take the value of that variable modulo 100. Every hundred iterations, that expression will have the value zero, so the echo statement after the "||" will get executed and print a dot. I use "echo -n" so we don't get a newline after the dot.

Notice also the "1>&2" after the echo expression. This causes the dots coming out of the echo command heading for the standard output to go to the standard error instead. That way I'm able to redirect the normal output of the loop-- the usernames and passwords I'm brute-forcing-- into a file using ">working-passwords.txt" at the end of the loop and still see the progress dots on the standard error.

You can slip this code into any loop you care to. And by adjusting the value on the right-hand side of the modulus operator you can cause the dots to be printed more or less frequently, depending on the size of your input. If you're reading a log file that's hundreds of thousands of lines long, you might want to do something like "... % 10000" so your screen doesn't just fill up with dots. On the other hand, you want the dots to appear frequently enough that it looks like something is happening. You just have to play around with the number until you're happy.

While this approach is very portable and easy to use, it only works inside an explicit loop. There are lots of tasks where we're processing data using a series of commands in a pipeline with no loops at all. For example, there's pipelines like the one from Episode #38:

grep 'POST /login/form' ssl_access_log* | 
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' |
sort | uniq -c |
awk '{ t = t + $1; print} END { print t " TOTAL" }'

Oh sure, I could force a loop at the front of the pipeline just to get some dots:

while read line; do
echo $line
(( $((++c)) % 10000 )) || echo -n . 1>&2
done <ssl_access_log* | grep 'POST /login/form' | ...

But let's face it, this is gross, inefficient, and silly. What bash is lacking is a built-in construct like PowerShell's Write-Progress cmdlet.

Happily, there's an Open Source utility called "pv" (pipe viewer) that kicks Write-Progress' butt through the flimsy walls of our command line dojo. Unhappily, it's not a built-in utility, so strictly speaking it's not allowed by the rules of our blog. But sometimes it's fun to bring a bazooka to a knife fight.

In it's simplest usage, pv just replaces the silly while loop that I forced onto the beginning of our pipeline:

# pv -c ssl_access_log* | 
grep 'POST /login/form' |
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' |
sort | uniq -c | awk '{ t = t + $1; print} END { print t " TOTAL" }' >counts.txt

83.4MB 0:00:05 [16.2MB/s] [=================================>] 100%

pv reads our input files and sends their content to the standard output-- just like the cat command. But it also creates a progress bar on the standard error. The "-c" option tells pv to use "curses" style cursor positioning sequences to update the progress bar more efficiently.

I'm redirecting the actual pipeline output with the browser counts into a file (">counts.txt") so it's easier to focus in on the progress bar. I've captured the output after the loop has completed, so you're seeing the 100% completion bar, but notice that the left-hand side of the bar tracks the total data read and the amount of time taken.

What's really fun, however, is using multiple instances of pv inside a complicated pipeline:

# pv -cN Input ssl_access_log* | 
grep 'POST /login/' | pv -cN grep |
sed -r 's/.*(MSIE [0-9]\.[0-9]|Firefox\/[0-9]+|Safari|-).*/\1/' | pv -cN sed |
sort | uniq -c | awk '{ t = t + $1; print} END { print t " TOTAL" }' >counts.txt

grep: 7.5MB 0:00:04 [1.51MB/s] [ <=> ]
sed: 259kB 0:00:05 [51.8kB/s] [ <=> ]
Input: 83.4MB 0:00:04 [17.1MB/s] [======================>] 100%

You'll notice that I've added two more pv invocations in the middle of our pipeline: one after the grep and one after the sed command. I'm also using the "-N" ("name") flag to assign a unique name to each instance of pv. This name appears in front of each progress bar so you can tell them apart.

What's fun about this mode is that it shows you how much you're reducing the data as it goes through each command. The total "Input" size is 83MB of access logs, which grep winnows down to 7.5MB of matching lines. Then sed removes everything except the browser name and major version number, leaving us with only 260KB of data.

pv is widely available in various Linux distros, though it's not typically part of the base install. There's a BSD Ports version available and it's even in the MacOS HomeBrew system. Solaris folks can find it at Sunfreeware. Everybody else gets to build it from source. But it's a useful tool in your command line toolchest.

Consider this your early Xmas present. And you didn't even have to brave the pepper spray at Wal*mart to get it.

Tuesday, November 15, 2011

Episode #162: Et Tu Bruteforce

Tim is looking for a way in

A few weeks ago I got a call from a Mr 53, of LaNMaSteR53 fame from the pauldotcom blog. Mister, Tim "I have a very cool first name" Tomes was working on a way to brute force passwords. The scenario is hundreds (or more) accounts were created all (presumably) using the same initial password. He noticed all the accounts were created the same day and none of them had ever been logged in to.

To brute force the passwords a subset of a large password dictionary is used tried against each account, but the same password was never used twice. This effectively bypasses the account lockout policy (5 failed attempts) and allows a larger set of passwords to be tested without locking out any accounts.

So instead of this scenario:
user1 - password1, password2, password3, password 4
user2 - password1, password2, password3, password 4
user3 - password1, password2, password3, password 4

We do it this way:
user1 - password1, password2, password3, password4
user2 - password5, password6, password7, password8
user3 - password9, password10, password11, password12

The effectiveness of this method is based on the assumption that each account was created with the same default password. Instead of testing 4 passwords, we can test 4 * # of users. So for 1000 accounts that means 4000 password guesses instead of just 4.

To pull this off we need to read two files, a user list and a password list. We need to take the first user and the first four passwords, then the send user and the next four passwords, and so on. This is the command to output the username and password pairs.

PS C:\> $usercount=0; gc users.txt | 
% {$user = $_; gc passwords.txt -TotalCount (($usercount * 4) + 4) |
select -skip ($usercount++ * 4) } | % { echo "$user $_" }

user1 password1
user1 password2
user1 password3
user1 password4
user2 password5
user2 password6
user2 password7
user2 password8
user3 password9

If we wanted to test the credentials against a domain controller we can do this:

PS C:\> $usercount=0; gc users.txt | % {$user = $_; 
gc passwords.txt -TotalCount (($usercount * 4) + 4) | select -skip ($usercount++ * 4) } |
% { net use \\mydomaincontroller\ipc$ /user:somedomain\$user $_ 2>&1>$null;
if ($?) { echo "This works $user/$_ "; net use \\mydomaincontroller\ipc$ /user:$user /del } }

This works user7/Password30


When Pen Testing you many times get access to CMD.EXE only. The PowerShell interfaces are a bit flaky, and many times the systems that are initially compromised don't have it installed so we need to rely on CMD.EXE.

C:\> cmd /v:on /c "set /a usercount=0 >NUL & for /F %u in (users.txt) do @set
/a passcount=0 >NUL & set /a lpass=!usercount!*4 >NUL & set /a upass=!usercount!*4+4
>NUL & @(for /F %p in (passwords.txt) do @(IF !passcount! GEQ !lpass! (IF !passcount!
LSS !upass! (@echo %u %p))) & set /a passcount=!passcount!+1 >NUL) & set /a
usercount=!usercount!+1 >NUL"

user1 password1
user1 password2
user1 password3
user1 password4
user2 password5
user2 password6
user2 password7
user2 password8
user3 password9

We start off enabling delayed variable expansion as usual. The usercount is initialized to 0 and it will be used to keep track of how many users have been attempted so far. We need this number to determine the proper password range to use. The users.txt file is then read via a For loop. Inside this (outer) For loop the passcount variable is set to 0. The passcount variable is used to keep track of where we are in the password file so we only use the 4 passwords we need. Related to that, the lower bound (lpass) and the upper bound (upass) are set so we know the range of the 4 passwords to be used. Now it is (finally) time to read the password file.

Another, inner, For loop is used to read through the password file. A pair of If statements are used to make sure the current password is in the proper bounds, and if it is, it is output. The passcount variable is then incremented to keep track of our count. After we go through the entire password file we increment the usercount. The process starts all over using the next user read from the file.

All we need to do now is Frankenstein this command with other Tim's command.

C:\> cmd /v:on /c "set /a usercount=0 >NUL & for /F %u in (users.txt) do @set
/a passcount=0 >NUL & set /a lpass=!usercount!*4 >NUL & set /a upass=!usercount!*4+4
>NUL & @(for /F %p in (passwords.txt) do @(IF !passcount! GEQ !lpass! (IF !passcount!
LSS !upass! (@net use \\DC01 /user:mydomain\%u %p 1>NUL 2>&1 && @echo This works
%u/%p && @net use /delete \\DC01\IPC$ > NUL))) & set /a passcount=!passcount!+1 >NUL)
& set /a usercount=!usercount!+1 >NUL"

This works user7/Password30

There you go, brute away.

Hal is looking for a way out

The basic task of generating the username/password list is pretty easy for the Unix folks because we have the "paste" command that lets us join multiple files together in a line-by-line fashion. The only real trick here is repeating each username input four times before moving on to the next username.

The first way that occurred to me to do this is with awk:

$ paste <(awk '{print; print; print; print}' users.txt) passwords.txt 
user1 password1
user1 password2
user1 password3
user1 password4
user2 password5

Here I'm using the bash "<(...)" notation to include the output of our awk command as a file input for the "paste" command. The awk itself just uses multiple print statements to emit each line four times.

Really all the awk is doing for us here, however, is to act as a short-hand for a loop over our user.txt file. We could dispense with the awk an just use shell built-ins:

$ paste <(while read u; do echo -e $u\\n$u\\n$u\\n$u; done <users.txt) passwords.txt 
user1 password1
user1 password2
user1 password3
user1 password4
user2 password5

Aside from using a while loop instead of the awk, I'm also using a single "echo -e" statement to output all four lines, rather than calling echo multiple times. I could have done something similar with a single print statement in the awk verson, but somehow I think the "print; print; print; print" was clearer and more readable.

By the way, some of you may be wondering why I have newlines ("\n", rendered above as "\\n" to protect the backwhack from shell interpolation) after the first three $u's but not after the last one. Remember that echo will automatically output a newline at the end of the output, unless we use "echo -n".

But now that we have our username/password list, what do we do with it? Unfortunately, the SMBFS tools for Unix/Linux don't include a working equivalent for "net use". So we'd have to try mounting a share the old-fashioned way in order to test the username and password combos:

paste <(awk '{print; print; print; print}' users.txt) passwords.txt |
while read u p; do
mount -t cifs -o domain=mydomain,user=$u,password=$p \
//myshare /mnt >/dev/null 2>&1 \
&& echo $u/$p works && umount /mnt

If the mount command succeeds then the echo command will output the username and password. Then we'll call umount to unmount the share before moving on to the next attempt. It's kind of hideous, but it will work.

Oh well, at least it's more readable than that CMD.EXE insanity Tim threw down...

Tuesday, November 8, 2011

Episode #161: Cleaning up the Joint

Hal's got email

Apparently tired of emailing me after we post an Episode, Davide Brini decided to write us with a challenge based on a problem he had to solve recently. Davide had a directory full of software tarballs with names like:


As the packages accumulate in the directory, Davide wanted to be able to get rid of everything but the most recent three tarballs. The trick is that we're only allowed to rely on the version number that's the third component of the file pathname, and not file metadata like the file timestamps. And of course our final solution should work no matter how many packages are in the directory or what their names are, and no matter how many versions of each package currently exist in the directory.

The code I used to create my test cases is actually longer than my final solution. Here's the quickie I tossed off to create a directory of interesting test files:

$ for i in one two three four five; do 
for j in {1..5}; do
touch pkg-$i-$RANDOM.tar.gz;

$ ls
pkg-five-20690.tar.gz pkg-four-6945.tar.gz pkg-three-29078.tar.gz
pkg-five-22215.tar.gz pkg-one-16581.tar.gz pkg-three-31807.tar.gz
pkg-five-24754.tar.gz pkg-one-18962.tar.gz pkg-two-1461.tar.gz
pkg-five-27332.tar.gz pkg-one-25712.tar.gz pkg-two-14713.tar.gz
pkg-five-3200.tar.gz pkg-one-5325.tar.gz pkg-two-23569.tar.gz
pkg-four-12855.tar.gz pkg-one-8421.tar.gz pkg-two-28329.tar.gz
pkg-four-14868.tar.gz pkg-three-11196.tar.gz pkg-two-526.tar.gz
pkg-four-17282.tar.gz pkg-three-15935.tar.gz
pkg-four-19436.tar.gz pkg-three-25092.tar.gz

The outer loop creates the different package names, and the inner loop creates five instances of each package. To get a wide selection of version numbers, I just use $RANDOM to get a random value between 1 and 32K.

The tricky part about this challenge is that tools like "ls" will sort the file names alphabetically rather than numerically. In the output above, for example, you can see that "pkg-two-526.tar.gz" sorts at the very end of the list, even though numerically version number 526 is the earliest version in the "pkg-two" series of files.

We can use "sort" to list the files in numeric order by version number:

$ ls | sort -nr -t- -k3 

Here I'm doing a descending ("reversed") numeric sort ("-nr") on the third hypen-delimited field ("-t- -k3"). All the package names are mixed up, but at least the files are in numeric order.

Now all I have to do is pick out the the fourth and later copies of any particular package name. For this there's awk:

$ ls | sort -nr -t- -k3 | awk -F- '++a[$1,$2] > 3' 

The "-F-" option tells awk to split its input on the hyphens. I'm using "++a[$1,$2]" to count the number of times I've seen a particular package name. When I get to the fourth and later entries for a given package, then my conditional statement will be true. Since I don't specify an action to take, the default assumption is "{print}" and the file name gets printed. Stick that in your awk pipe and smoke it, Davide!

Removing the files instead of just printing their names is easy. Just pipe the output into xargs:

$ ls | sort -nr -t- -k3 | awk -F- '++a[$1,$2] > 3' | xargs rm -f
$ ls
pkg-five-22215.tar.gz pkg-four-19436.tar.gz pkg-three-29078.tar.gz
pkg-five-24754.tar.gz pkg-one-16581.tar.gz pkg-three-31807.tar.gz
pkg-five-27332.tar.gz pkg-one-18962.tar.gz pkg-two-14713.tar.gz
pkg-four-14868.tar.gz pkg-one-25712.tar.gz pkg-two-23569.tar.gz
pkg-four-17282.tar.gz pkg-three-25092.tar.gz pkg-two-28329.tar.gz

I've used the "-f" option here just so that we don't get an error message when we run the command and there end up being no files that need to be removed.

And that's my final answer, Regis... er, Davide! Thanks for a fun challenge! To make things really interesting for Tim, I think we should make him do this one in CMD.EXE, don't you?

Tim thinks Hal is mean

Not only does Hal throw down the gauntlet and request CMD.EXE, but he makes this problem more difficult by making this two challenges in one. Not being one to turn down a challenge (even though I should), we start off with PowerShell by creating the test files:

PS C:\> foreach ($i in "one","two","three","four","five" ) {
foreach ($j in 1..5) {
Set-Content -Path "pkg-$i-$(Get-Random -Minimum 1 -Maximum 32000).tar.gz" -Value ""
} }

PS C:\> ls

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/1/2011 1:23 PM 2 pkg-five-19410.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-21426.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-26739.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-27296.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-6618.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-18533.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-25925.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-31089.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-511.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-8343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-13225.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-24343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-2835.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-308.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-4484.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-13226.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-15026.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-23830.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-30553.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-4311.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-12923.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27368.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27692.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-28727.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-3888.tar.gz

Similar to what Hal did, we use multiple loops to create the files. Set-Content is used to create the file. The filename is a little crazy as we need to use the output of Get-Random in our path. The $() is used to wrap the cmdlet and only return the output.

I feel a big like a ditch digger who is tasked with filling in the ditch he just dug, but that's the challange. We have files and some need to be deleted.

We start off grouping the files based on their package and sorting them by their version.

PS C:\> ls | sort {[int]($_.Name.Split("-.")[2])} -desc |
group {$_.Name.Split("-.")[1]}

Count Name Group
----- ---- -----
5 four {pkg-four-31089.tar.gz, pkg-four-25925.tar.gz, pkg-four-1853...
5 three {pkg-three-30553.tar.gz, pkg-three-23830.tar.gz, pkg-three-1...
5 two {pkg-two-28727.tar.gz, pkg-two-27692.tar.gz, pkg-two-27368.t...
5 five {pkg-five-27296.tar.gz, pkg-five-26739.tar.gz, pkg-five-2142...
5 one {pkg-one-24343.tar.gz, pkg-one-13225.tar.gz, pkg-one-4484.ta...

The package and version number are retrieved by using the Split method using dots and dashes as delimiters. The version is the 3rd item (index 2, remember, base zero) and the package is the 2nd (index 1). The version is used to sort and the package name is used for grouping.

At this point we have groups that contain the files sorted, in descending order, by the version number. Now we need to get all but the first two items.

PS C:\> ls | sort {[int]($_.Name.Split("-.")[2])} -desc |
group {$_.Name.Split("-.")[1]} | % { $_.Group[2..($_.Count)]}

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/1/2011 1:23 PM 2 pkg-four-18533.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-8343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-511.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-15026.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-13226.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-4311.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27368.tar.gz

The ForEach-Object cmdlet (alias %) is used to operate on each group. As you will remember, the items in the group are sorted in descending order by the version number. We need to select the 3rd through the last item, and this is accomplished by using the Range operator (..) with our collection of objects. The Range of 2..($_.Count) gives us everything but the first two items. Technically, I have an off-by-one issue with the upper bound, but PowerShell is kind enough not to barf on me. I did this to save a few key strokes; although, I am using a lot more key strokes to justify my laziness. Ironic? Yes.

All we have to do now is pipe it into the Remove-Item (alias del, erase, rd, ri, rm, rmdir).

PS C:\> ls | sort {[int]($_.Name.Split("-.")[2])} -desc |
group {$_.Name.Split("-.")[1]} | % { $_.Group[2..($_.Count)]} | rm

PS C:\> ls

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/1/2011 1:23 PM 2 pkg-five-26739.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-five-27296.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-25925.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-four-31089.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-13225.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-one-24343.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-23830.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-three-30553.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-27692.tar.gz
-a--- 11/1/2011 1:23 PM 2 pkg-two-28727.tar.gz

Not too bad, but now it is time for the sucky part.


Here is the file creator:

C:\> cmd /v:on /c "for /r %i in (pkg-one pkg-two pkg-three pkg-four pkg-five) do
@for /l %j in (1,1,5) do @echo "" > %i-!random!.tar.gz"

Similar to the previous examples, this uses two loops to write our file.

Now for the beast to nuke the old packages...

C:\> cmd /v:on /c "for /f %a in ('^(for /f "tokens=2 delims=-." %b in ^('dir /b *.*'^) do
@echo %b ^) ^| sort') do @set /a first=0 > NUL & @set /a second=0 > NUL & @(for /f "tokens=1,2,3,*
delims=-." %i in ('dir /b *.* ^| find "%a"') do @set /a v=%k > NUL & IF !v! GTR !first! (del
%i-%j-!second!.tar.gz && set /a second=!first! > NUL && set /a first=!v! > NUL) ELSE (IF !v! GTR
!second! (del %i-%j-!second!.tar.gz && set /a second=!v! > NUL) ELSE (del %i-%j-!v!.tar.gz)))"

C:\> dir
Volume in drive C has no label.
Volume Serial Number is DEAD-BEEF

Directory of C:\

11/01/2011 01:23 PM <DIR> .
11/01/2011 01:23 PM <DIR> ..
11/01/2011 01:23 PM 2 pkg-five-26739.tar.gz
11/01/2011 01:23 PM 2 pkg-five-27296.tar.gz
11/01/2011 01:23 PM 2 pkg-four-18533.tar.gz
11/01/2011 01:23 PM 2 pkg-four-8343.tar.gz
11/01/2011 01:23 PM 2 pkg-one-13225.tar.gz
11/01/2011 01:23 PM 2 pkg-one-24343.tar.gz
11/01/2011 01:23 PM 2 pkg-three-23830.tar.gz
11/01/2011 01:23 PM 2 pkg-three-30553.tar.gz
11/01/2011 01:23 PM 2 pkg-two-27692.tar.gz
11/01/2011 01:23 PM 2 pkg-two-28727.tar.gz
10 File(s) 20 bytes
2 Dir(s) 1,234,567,890 bytes free

As this command is barely decipherable, I'm not going to go through it in great detail, but I will describe it at a high level.

We start off by enabling delayed variable expansion so we can set and immediately use a variable. We then use a trusty For loop (actually, I don't trust the sneaky bastards) to find the package names. We then use another For loop to work with each file that matches the current package by using a directory listing plus the Find command. Now is where it get really hairy...

We need to keep the two files with the highest version number. To do this we use two variables, First and Second, to hold the two highest version numbers. Both variables are initialized to zero. Next we need to do some crazy comparisons.

1. If the version number of the current file for the current package is greater than First, we delete the file related to Second, move First to Second, and set First equal to the current version.

2. If the version number of the current file for the current package is less than First but greater than Second, we delete the file related to Second and set Second equal to the current version.

3. If the version number of the current file for the current package is less than both First and Second then the file is deleted.

Ok, Hal, you have your CMD.EXE. I would say "TAKE THAT", but I'm pretty sure I'm the one that was taken.

Tuesday, October 18, 2011

Episode #160: Plotting to Take Over the World

Hal's been teaching

Whew! Just got done with another week of teaching, this time at SANS Baltimore. I even got a chance to give my "Return of Command Line Kung Fu" talk, so I got a bunch of shell questions.

One of my students had a very interesting challenge. To help analyze malicious PDF documents, he was trying to parse the output of Didier Stevens' pdf-parser.py and create an input file for GNUplot that would show a graph of the object references in the document. Here's a sample of the kind of output we're dealing with:

$ pdf-parser.py CLKF.pdf
PDF Comment '%PDF-1.3\n'

PDF Comment '%\xc7\xec\x8f\xa2\n'

obj 5 0
Referencing: 6 0 R
Contains stream
[(1, '\n'), (2, '<<'), (2, '/Length'), (1, ' '), (3, '6'), (1, ' '), (3, '0'), (1, ' '), (3, 'R'), (2, '/Filter'), (1, ' '), (2, '/FlateDecode'), (2, '>>'), (1, '\n')]

/Length 6 0 R
/Filter /FlateDecode

obj 6 0
[(1, '\n'), (3, '678'), (1, '\n')]

obj 4 0
Type: /Page
Referencing: 3 0 R, 11 0 R, 12 0 R, 13 0 R, 5 0 R

The lines like "obj 5 0" give the object number and version of a particular object in the PDF. The "Referencing" lines below show the objects referenced. A given object can reference any number of objects from zero to many.

To make the chart with GNUplot, we need to create an input file that shows "obj -> ref;" for all references. So for object #5, we'd have one line of output that shows "5 -> 6;". There would be no output for object #6, since it references zero objects. And we'd get 5 lines of output for object #4, "4 -> 3;", "4 -> 11;", and so on.

This seems like a job for awk. Frankly, I thought about just calling Davide Brini and letting him write this week's Episode, but he's already getting too big for his britches. So here's my poor, fumbling attempt:

$ pdf-parser.py CLKF.pdf |
awk '/^obj/ { objnum = $2 };
/Referencing: [0-9]/ \
{ max = split($0, a);
for (i = 2; i < max; i += 3) { print objnum" -> "a[i]";" }

5 -> 6;
4 -> 3;
4 -> 11;
4 -> 12;
4 -> 13;
4 -> 5;

The first line of awk matches the "obj" lines and puts the object number into the variable "objnum". The second awk expression matches the "Referencing" lines, but notice that I added a "[0-9]" at the end of the pattern match so that I only bother with lines that actually include referenced objects.

When we hit a line like that, then we do the stuff in the curly braces. split() breaks our input line, aka "$0", on white space and puts the various fields into an array called "a". split() also returns the number of elements in the array, which we put into a variable called "max". Then I have a for loop that goes through the array, starting with the second element-- this is the actual object number that follows "Referencing:". Notice the loop update code is "i += 3", which allows me to just access the object number elements and skip over the other crufty stuff I don't care about. Inside the loop we just print out the object number and current array element with the appropriate punctuation for GNUplot.

Meh. It's a little scripty, I must confess. Mostly because of the for loop inside of the awk statement to iterate over the references. But it gets the job done, and I really did compose this on the command line rather than in a script file.

Let's see if Tim's plotting involves a trip to Scriptistan as well...

Tim's traveling

While I have been out of the country for a few weeks, I didn't have to visit Scriptistan to get my fu for this week. The PowerShell portion is a bit long, but I wouldn't classify it as a script even though it has a semicolon in it. We do have lots of ForEach-Object cmdlets, Select-String cmdlets, and Regular Expressions. And you know what they say about Regular Expressions: Much like violence, if Regular Expressions aren't working, you aren't using enough of it.

Instead of starting off with some ultraviolent fu, let's build up to that before we wield the energy to destroy medium-large buildings. First, let's find the object number and its references.

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)\d+" -Context 0,2

> obj 5 0
Referencing: 6 0 R
> obj 6 0
> obj 15 0
Referencing: 16 0 R
> obj 4 0
Type: /Page
Referencing: 3 0 R, 11 0 R, 12 0 R, 13 0 R, 5 0 R

The output of pdf-parser.py is piped into the Select-String cmdlet which finds lines that start with "obj", are followed by a space (\s), then one or more digits (\d+). The Context switch is used to get the next two lines so we can later use the "Referencing" portion.

You might also notice our regular expression uses a "positive look behind", meaning that it needs to see "obj " before the number we want. This way we end up with just the object number being selected and not the useless text in front of it. This is demonstrated by Matches object shown below.

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 | Format-List

IgnoreCase : True
LineNumber : 7
Line : obj 5 0
Filename : InputStream
Path : InputStream
Pattern : (?<=^obj\s)[0-9]+
Context : Microsoft.PowerShell.Commands.MatchInfoContext
Matches : {5}

To parse the Referencing line we need we need to use some more violence regular expressions on the Context object. First, let's see what the Context object looks like. To do this we previous command into the command below to see the available properties.

PS C:\> ... | Select-Object -ExcludeProperty Context | Get-Member

TypeName: Microsoft.PowerShell.Commands.MatchInfoContext

Name MemberType Definition
---- ---------- ----------
Clone Method System.Object Clone()
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
DisplayPostContext Property System.String[] DisplayPostContext {get;set;}
DisplayPreContext Property System.String[] DisplayPreContext {get;set;}
PostContext Property System.String[] PostContext {get;set;}
PreContext Property System.String[] PreContext {get;set;}

The PostContext property contains the two lines that followed our initial match. We can access the second line by access the row with an index of 1 (remember, base zero, so 1=2).

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 |
ForEach-Object { $objnum = $_.matches[0].Value; $_.Context.PostContext[1] }

Referencing: 6 0 R
Referencing: 16 0 R
Referencing: 25 0 R

The above command saves the current object number in $objnum and then outputs the second line of the PostContext.

Finally, we need to parse the Context with ultra violence more regular expressions and display our output.

PS C:\> C:\Python25\python.exe pdf-parser.py CLKF.pdf |
Select-String -Pattern "(?<=^obj\s)[0-9]+" -Context 0,2 |
% { $objnum = $_.matches[0].Value; $_.Context.PostContext[1] |
Select-String "(\d+(?=\s0\sR))" -AllMatches | Select-Object -ExpandProperty matches |
ForEach-Object { "$($objnum)->$($_.Value)" } }

5 -> 6;
4 -> 3;
4 -> 11;
4 -> 12;
4 -> 13;
4 -> 5;

The second line of PostContect, the Referencing line, is piped into the Select-String cmdlet where we use our regular expression to look for the a number followed by "<space>0<space>R". The AllMatches switch is used to find all the objects referenced. We then Expand the matches property so we can work with each match inside our ForEach-Object cmdlet where we output the original object number and the found reference.

Tuesday, October 4, 2011

Episode #159: Portalogical Exam

Tim finally has an idea

Sadly, we've been away for two weeks due to lack of new, original ideas for posts. BUT! I came up with and idea. Yep, all by myself too. (By the way, if you have an idea for an episode send it in)

During my day job pen testing, I regularly look at nmap results to see what services are available. I like to get a high level look at the open ports. For example, lots of tcp/445 means a bunch of Windows boxes. It is also useful to quickly see the one off ports, and in this line of work, the one offs can be quite important. One unique service may be legacy, special (a.k.a. not patched), or forgotten.

Nmap has a number of output options: XML, grep'able output, and standard nmap output. PowerShell really favors objects, which means that XML will work great. So let's start off by reading the file and parsing it as XML.

PS C:\> [xml](Get-Content nmap.xml)

xml xml-stylesheet #comment nmaprun
--- -------------- -------- -------
version="1.0" href="file:///usr/local/sh... Nmap 5.51 scan initiated ... nmaprun

Get-Content (alias gc, cat, type) is used to read the file, then [xml] parses it and converts it to an XML object. After we have an XML object, we can see all the nodes of the document. To access each node we access it like any property:

PS C:\> ([xml](gc nmap.xml)).nmaprun.host

But each host has a ports property that needs to be expanded, and each ports property has multiple port properties to be expanded. To do this we use a pair of Select-Object cmdlets with the -ExpandProperty switch (-ex for short).

PS C:\> ([xml](gc .nmap.xml)).nmaprun.host | select -expand ports | 
select -ExpandProperty port

protocol portid state service
-------- ------ ----- -------
tcp 22 state service
tcp 23 state service
tcp 80 state service
tcp 443 state service
tcp 80 state service
tcp 443 state service

Nmap can have information on closed ports, so I like to make sure that I am only looking at open ports. We use the Where-Object cmdlet (alias ?) to filter for ports that are open. Each port has a state element and a state property, and we'll check if the state of the state (yep, that's right) is open:

PS C> ... | ? { $_.state.state -eq "open" }

The output is the same, just with extra filtering. Now all we need to do is count. To do that, we use the Group-Object cmdlet (alias group).

PS C:\> ([xml](gc nmap.xml)).nmaprun.host | select -expand ports | 
select -ExpandProperty port | ? { $_.state.state -eq "open" } |
group protocol,portid -NoElement

Count Name
----- ----
12 tcp, 80
1 tcp, 25
12 tcp, 443
2 tcp, 53

The -NoElement switch tells the cmdlet to discard the individual objects and just give us the group information.

Of course, if we are looking for patterns or one off ports we need use the Sort-Object cmdlet (alias sort) by the Count and use the -Descending switch (-desc for short)..

PS C:\> ([xml](gc nmap.xml)).nmaprun.host | select -expand ports | 
select -ExpandProperty port | ? { $_.state.state -eq "open" } |
group protocol,portid -NoElement | sort count -desc

Count Name
----- ----
12 tcp, 443
12 tcp, 80
2 tcp, 53
2 tcp, 18264

Now that's handy, but many times I have multiple scans, like a UDP scan and a TCP scan. If we want to combine multiple scans into one table we can do it relatively easily.

PS C:\> ls *.xml | % { ([xml](gc $_)).nmaprun.host } | select -expand ports |
select -ExpandProperty port | ? { $_.state.state -eq "open" } |
group protocol,portid -NoElement | sort count -desc

Count Name
----- ----
12 tcp, 443
12 tcp, 80
3 udp, 161
2 tcp, 53
2 tcp, 18264

The beauty of PowerShell's pipeline is that we can use any method we want to pick the files, then feed them into the next command with a ForEach-Object loop (alias %).

Now that I've checked all my ports, it's time for Hal to get his checked.

Hal examines his options

Tim, when you get to be my age, you'll get all of your ports checked on an annual basis.

Now let's examine this so-called idea of Tim's. Oh sure, XML is all fine and dandy for fancy scripting languages like Powershell. But you'll notice he didn't even attempt to do this in CMD.EXE. Weakling.

While XML is generally a multi-line format and not typically conducive to shell utilities that operate on a "line at a time" basis, for something simple like this we can easily hack together some code. In the XML format, the lines that show open ports have a regular format:

<port protocol="tcp" portid="443"><state state="open" />...</port>

So pardon me while I throw down some sed:

$ sed 's/^<port protocol="\([^"]*\)" portid="\([^"]*\)"><state state="open".*/\2\/\1/; 
t p; d; :p' test.xml | sort | uniq -c | sort -nr -k1 -k2

6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp

The first part of the sed expression is a substitution that matches the protocol name and port number and replaces the entire line with just "<port>/<protocol>". Now I only want to output the lines where this substitution succeeded, so I use "t p" to branch to the label ":p" whenever the substitution happens. If we don't branch, then we hit the "d" command to drop the pattern space without printing and move onto the next line. Since the ":p" label we jump to on a successful substitution is an empty block, sed just prints the pattern space and moves onto the next line. This is a useful sed idiom for only printing our matching lines.

The rest of the pipeline puts the output lines from sed into sorted order so we can feed them into "uniq -c" to count the occurrences of each line. After that we use sort again to do a descending numeric sort ("-nr") of first the counts ("-k1") and then the port numbers ("-k2"). And that give us the output we want.

I actually find the so-called "grep-able" output format of Nmap kind of a pain to deal with for this kind of thing. That's because Nmap insists on jamming all of the port information together into delimited, but variable length lines like this:

Host: (test.deer-run.com) Ports: 22/open/tcp//ssh///, 
25/open/tcp//smtp///, 53/open/tcp//domain///, 80/open/tcp//http///,
139/open/tcp//netbios-ssn///, 143/open/tcp//imap///, 443/open/tcp//https///,
445/open/tcp//microsoft-ds///, 514/open/tcp//shell///, 587/open/tcp//submission///,
601/open/tcp/////, 902/open/tcp//iss-realsecure-sensor///, 993/open/tcp//imaps///,
1723/open/tcp//pptp///, 8009/open/tcp//ajp13/// Seq Index: 3221019...

So to handle this problem, I'm just going to use tr to convert the spaces to newlines, forcing the output to have a single port entry per line. After that, it's just awk:

$ cat test.gnmap | tr ' ' \\n | awk -F/ '/\/\/\// {print $1 "/" $3}' | 
sort | uniq -c | sort -nr -k1 -k2

6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp

The port listings all end with "///", so I use awk to match those lines and output the port and protocol fields. Notice the "-F/" option so that awk uses the slash character as the field delimiter instead of whitespace. After that it's the same "sort ... | uniq -c | sort ..." pipeline we used in the last case to format the output.

The easiest case is actually the regular Nmap output:

$ awk '/^[0-9]/ {print $1}' test.nmap | sort | uniq -c | sort -nr -k1 -k2
6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp

The lines about open ports are the only ones in the output that start with digits. So it's a quick awk expression to match these lines and output the port specifier. After that, we use the same pipeline we used in the previous examples to format the output appropriately.

So, Tim, my shell may not have built-in tools to parse XML but it's apparently three times the shell that yours is. Stick that in your port and smoke it.

Because Davide sed So

Proving he's not just a whiz at awk, Davide Brini wrote in with a more elegant sed idiom for just printing the lines that match our substitution:

$ sed -n 's/^<port protocol="\([^"]*\)" portid="\([^"]*\)"><state state="open".*/\2\/\1/p' test.xml | 
sort | uniq -c | sort -nr -k1 -k2

6 443/tcp
5 80/tcp
3 22/tcp
2 3306/tcp
1 9100/tcp

"sed -n ..." suppresses the normal sed output and "s/.../.../p" causes the lines that match our substitution to be printed out. And that's much easier. Thanks, Davide!

Tuesday, September 13, 2011

Revisiting Episode #151: Readers' Revenge!

Hal's a football widow

Well it's the start of football season here in the US, and Tim's locked himself in his "Man Cave" to catch all of the action. For our readers outside the US, our version of football is played with a "ball" that isn't at all round and which rarely touches the players' feet. We just called it football to confuse the rest of the world.

Since football really isn't my sport, I figured I'd spend some time this weekend catching up on reader responses to some of our past Episodes. Back in Episode #151 I sort of threw down the gauntlet at the end of my solution when I stated, "I'm sure I could accomplish the same thing with some similar looking awk code, but it was fun trying to do this with just shell built-ins." I figured that mention of an awk solution would bring an email from Davide Brini, and in this I was not disappointed.

Davide throws down

Let's just get straight to the awk, shall we:

echo -n $PATH | awk 'BEGIN { RS = ":" }; 
!a[$0]++ { printf "%s%s", s, $0; s = RS };
END { print "" }'

There's some sneaky clever bits here that bear some explanation:

  • In the BEGIN block, Davide is setting "RS"-- the "record separator" variable-- to colon. That means awk will treat each element of our input path as a separate record, automatically looping over each individual element and evaluating the statement in the middle of the example above.

  • That statement begins with a conditional operator, "!a[$0]", combined with an auto-increment, "++". In the conditional expression, "a" is an associative array that's being indexed with the elements of our $PATH. "$0" is the current "record" in the path that awk is giving us. So "!a[$0]" is true if we don't already have an entry for the current $PATH element in the array "a".

  • True or false, however, the auto-increment operator is going to add one to the value in "a[$0]", ensuring that if we run into a duplicate later in $PATH then the "!a[$0]" condition will return false.

  • If "!a[$0]" is true (it's the first time we've encounted a given directory in $PATH), then we execute the block after the conditional. That prints the value of variable "s" followed by the directory name, "$0". The first time through the loop, "s" will be null and we just print the directory. However, the second statement in the loop sets "s" to be colon (the value of "RS"), so in future iterations we'll print a colon before the directory name, so that everything gets nicely colon-separated.

  • In the end block, we output a null value. But this has the side effect of spitting out a newline at the end of our output, which makes things more readable.

Phew! That's some fancy awk, Davide. Thanks for the code!

Who let the shells out?

What I wasn't expecting was a note from loyal reader Daniel Miller, who took my whole "just shell built-ins" comment quite seriously. I had some sed mixed up in my final solution, but Daniel provided the following shell-only solution:

$ declare -A p
$ for d in ${PATH//:/ }; do [[ ${p[$d]} ]] || u[$((c++))]=$d; p[$d]=1; done
$ IFS=:
$ echo "${u[*]}"
$ unset IFS

I am limp with admiration. Daniel replaces the sed in my loop with the shell variable substitution operator "${var/.../...}" that we've used in previous Episodes. The clever bit, though, is that he's added a new array called "u" to the mix to keep track of the unique directory names, in order, as we progress through the elements of $PATH.

Inside the loop we check our associative array "p" as before to see whether we've encountered a given directory, $d, or not. If this is the first time we've seen $d, then "[[ ${p[$d]} ]]" will be false, and so we'll execute the statement after the "||", which adds the directory name to our array "u". The clever bit is the "$((c++))" in the array index, which uses "c" as an auto-incrementing counter variable to keep extending the "u" array as necessary to add new directory names.

You'll notice, however, that we're not outputting anything inside the loop. After the loop is finished, Daniel uses "echo "${n[*]}"" to output all of the elements of "n" with a single statement. The neat thing about the "${n[*]}" syntax is that it uses the first character of IFS to separate the array elements as they're being printed. So Daniel sets IFS to colon before the echo statement-- and then unsets it afterwards because having IFS set to colon is surely going to mess up later commands! In fact, Daniel suggests putting all of this mess into a shell function where you can declare IFS as a local variable and not mess up other commands.

Anyway, thanks as always to our readers for their efforts to improve our humble shell efforts. I'll see if I can drag Tim out of the Man Cave in time for next week's Episode...

Tuesday, September 6, 2011

Episode #158: The Old Switcheroo

Tim checks the mail

I went to the mailbox and what do you know, more mail! Chris Sakalis writes in:

Dear command line saolins,

The year before, I was given a bash assignment asking to search the Linux kernel code and replace occurrences of "Linus" with another string, while writing the filename, line number and the new line where a change was made in a file. While this has no practical use, it can be easily generalized to any search and replace/log operation. At first I thought sed was the best and fastest choice, but I couldn't manage writing the edited lines in a file. So I used bash built-in commands and made this:


However not only this isn't a one liner but also it's very slow. Actually, running grep and sed twice, one for the log and one for the actual replacement is faster.
Can you think of any way to turn this into a fast one liner?

Well sir, I can do it in one, albeit quite long, line!

PS C:\> Get-ChildItem -Recurse |  Where-Object { -not $_.PSIsContainer -and
(Select-String -Path $_ -Pattern Linus -AllMatches | Tee-Object -Variable lines) } |
ForEach-Object { $lines; $f = $_.FullName; Move-Item $f "$($f).orig";
Get-Content "$($f).orig" | ForEach-Object { $_ -replace "Linus", "Bill" } | Set-Content $f }

file1.txt:15:Some Text Linus Linus Linus Some Text
somedir\file3.txt:13:My Name is Linus
somedir\file3.txt:37:Blah Linus Blah

We start off by getting a recursive directory listing and piping it into the Where-Object cmdlet for filtering. The first portion of our filter looks for objects that aren't containers, so we just get files. Inside the filter we also search the file with Select-String to find all the occurrences of "Linus". The results are piped into Tee-Object which will output the data and save it in the variable $lines so we can display it later. That sounds redundant, but it isn't. Our filter needs to evaluate the The Select-String + Tee-Object as True or False so it can determine if it should pass the objects. Any non-null output will evaluate to True, while no output will evaluate to false. Any such output will be eaten by Where-Object so it won't be displayed. In short, if it finds a string in the file matching "Linus" it will evaluate to True. We are then left with objects that are files and contain "Linus". Now to do the renaming and search and replace.

The ForEach-Object cmdlet will operate on each file that makes it through the filter. We first output the $lines variable to display the lines that contained "Linus". Next, we save the full path of the file in the variable $f. The file is then renamed with an appended ".orig". Next, we use Get-Content to pipe the contents into another ForEach-Object cmdlet so we can operate on each line. Inside this loop we do the search and replace. Finally, the results are piped into Set-Content to write the file.

As usual, we can shorten the command using aliases and positional parameters.

PS C:\> ls -r |  ? { !$_.PSIsContainer -and   (Select-String -Path $_ -Pattern Linus -AllMatches | tee -var lines) } |
% { $lines; $f = $_.FullName; mv $f "$($f).orig"; gc "$($f).orig" | % { $_ -replace "Linus", "Bill" } | sc $f }

file1.txt:15:Some Text Linus Linus Linus Some Text
somedir\file3.txt:13:My Name is Linus
somedir\file3.txt:37:Blah Linus Blah

The output displayed above is the default output of the MatchInfo object. Since it is an object we could display it differently if we like by piping it into Select-Object and picking the properties we would like to see.

... $lines | Select-Object Path, LineNumber, Line ...

Path LineNumber Line
---- ---------- ----
C:\file1.txt 12 Linus
C:\file1.txt 15 Some Text Linus Linus Linus Some Text
C:\somedir\file3.txt 13 My Name is Linus
C:\somedir\file3.txt 37 Blah Linus Blah

Hal, do you have a one liner for us?

Hal checks his briefs

I don't think a one-liner is the way to go here. A little "divide and conquer" will serve us better.

There are really two problems in this challenge. The first is to do our string replacement, and I can handle that with a little find/sed action:

$ find testdir -type f | xargs sed -i.orig 's/Linus/Bill/g'

Here I'm using the "-i" option with sed to do "in place" editing. A copy of the unmodified file will be saved with the extension ".orig" and the original file name will contain the modified version. The only problem is that sed will make a *.orig copy for every file found-- even if it makes no changes. I'd actually like to clean away any *.orig files that are the same as the "new" version of the file, but I can take care of that in the second part of the solution.

We can use diff to find the changed lines in a given file. But the output of diff needs a little massaging to be useful:

$ diff foo.c foo.c.orig
< Bill
< wumpus Bill Bill Bill
> Linus
> wumpus Linus Linus Linus
< Bill wumpus Linux Linux Bill
< Bill
> Linus wumpus Linux Linux Linus
> Linus
< Bill
> Linus

I don't care about anything except the lines that look like "2,3c2,3". Those lines are giving me the changed line numbers ("change lines 2-3 in file #1 to look like lines 2-3 in file #2"). I can use awk to match the lines I want, split them on "c" ("-Fc") and print out the first set of line numbers. Something like this for example:

$ diff foo.c foo.c.orig | awk -Fc '/^[0-9]/ { print $1 }'

Then I can add a bit more action with tr to convert the commas to dashes and the newlines to commas:

$ diff foo.c foo.c.orig | awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,

I've got a trailing comma and no newline, but I've basically got a list of the changed line numbers from a single file. Now all I need to do is wrap the whole thing up in a loop:

find testdir -name \*.orig | while read file; do 
diff=$(diff ${file/%.orig/} $file |
awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,);
[[ "$diff" ]] &&
echo ${file/%.orig/}: $diff ||
rm "$file";
done | sed 's/,$//'

In the first statement of the loop we assign the output of our diff pipeline to a variable called $diff. In the second statement of the loop I'm using the short-circuit logical operators "&&" and "||" as a quick and dirty "if-then-else". Essentially, if we got any output in $diff then we output the file name and the list of changed line numbers. Otherwise we remove the *.orig file because the file was not changed. Finally, I use another sed expression at the end of the loop to strip off the trailing commas from each line of output.

While this two-part solution works fine, Chris and I spent some time trying to figure out how to optimize the solution further (and, yes, laughing about how hard this week's challenge was going to be for Tim). Chris had the crucial insight that running sed on every file-- even if it doesn't include the string we want to replace-- and then having to diff every single file was a huge waste. By being selective at the beginning, we can actually save a lot of time:

# search and replace
find testdir -type f | xargs grep -l Linus | xargs sed -i.orig 's/Linus/Bill/g'

# Output changed lines
find testdir -name \*.orig | while read file; do
echo ${file/%.orig/}: $(diff ${file/%.orig/} $file |
awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,)
done | sed 's/,$//'

Notice that we've introduced an "xargs grep -l Linus" into the first shell pipeline. So the only files that get piped into sed are ones that actually contain the string we're looking to replace. That means in the while loop, any *.orig file we find will actually contain at least one change. So we don't need to have a conditional inside the loop anymore. And in general we have many fewer files to work on, which also saves time. For Chris' sample data, the above solution was twice as fast as our original loop.

So while it seems a little weird to use grep to search for our string before using sed to modify the files, in this case it actually saves us a lot of work. If nearly all of your input files contained the string you were replacing, then the grep would most likely make the solution take longer. But if the replacements are sparse, then pre-checking with grep is the way to go.

So thanks Chris for a fun challenge... and for creating another "character building" opportunity for Tim...

Steven can do that in one line!

Loyal reader Steven Tonge contacted us via Twitter with the following one-liner:

find testdir -type f | xargs grep -n Linus | tee lines-changed.txt | 
cut -f1 -d: | uniq | xargs sed -i.orig 's/Linus/Bill/g'

Bonus points for using tee, Steven!

The first part of the pipeline uses "grep -n" to look for the string we want to change. The "-n" outputs the line number of the match, and grep will automatically include the file name because we're grepping against multiple files. So the output that gets fed into tee looks like this:

testdir/foo.c:3:wumpus Linus Linus Linus
testdir/foo.c:5:Linus wumpus Linux Linux Linus

The tee command makes sure we save a copy of this output into the file lines-changed.txt, so that we have a record of the lines that were changed.

But tee also passes the output from grep along to the next part of the pipeline. Here we use cut to split out the file name, and uniq to make sure we only pass one copy of the file name along to our "xargs sed ..." command.

So Steven stumps the CLKF Masters with a sexy little one-liner. Awesome work!

Tuesday, August 23, 2011

Episode #157: I Ain't No Fortunate One

Hal to the rescue!

We were kicking around ideas for this week's Episode and Tim suggested a little command-line "Russian Roulette". The plan was to come up with some shell fu that would pick a random number between one and six. When the result came up one, you "lost" and the command would randomly delete a file in your home directory.

Holy carp, Tim! This is the kind of thing you do for fun? Why don't we do an Episode about seeing how many files you can delete from your OS before it stops working? It's not like our readers would come looking for us with torches and pitchforks or anything. Geez.

Now I'm as big a fan of rolling the dice as anybody, but let's try something a bit more gentle. What I'm going to do is pick random sayings out of the data files used by the "fortune" program. For those of you who've never looked at these files before, they're just text files with various pithy quotes delimited by "%" markers:

$ head -15 /usr/share/games/fortunes/linux

"How do you pronounce SunOS?" "Just like you hear it, with a big SOS"
-- dedicated to Roland Kaltefleiter
finlandia:~> apropos win
win: nothing appropriate.
C:\> WIN
Bad command or filename

Loading Microsoft Windows ...
Linux ext2fs has been stable for a long time, now it's time to break it
-- Linuxkongre├č '95 in Berlin

In order to pick one of these quotes randomly, I'm going to need to know how many there are in the file:

$ numfortunes=$(grep '^%$' /usr/share/games/fortunes/linux | wc -l)

$ echo $numfortunes

By the way, there's no off-by-one error here because there actually is a trailing "%" as the last line of the file.

OK, now that we know the number of fortunes we can pick from, I can choose which numbered fortune I want with a little modular arithmetic:

$ echo $(( $RANDOM % $numfortunes + 1 ))

$ echo $(( $RANDOM % $numfortunes + 1 ))
$ echo $(( $RANDOM % $numfortunes + 1 ))

I've used $RANDOM a couple of times in past Episodes-- it's simply a special shell variable that produces a random value between 0 and 32K. I'm just using arithmetic here to turn that into a value between 1 and $numfortunes.

But having selected the number of the fortune we want to output, how do we actually pull it out of the file and print it? Sounds like a job for awk:

$ awk "BEGIN { RS = \"%\" }; 

NR == $(( $RANDOM % $numfortunes + 1 ))" /usr/share/games/fortunes/linux

#if _FP_W_TYPE_SIZE < 32
#error "Here's a nickel kid. Go buy yourself a real computer."
-- linux/arch/sparc64/double.h

In awk, the "BEGIN { ... }" block happens before the input file(s) get read or any of the other awk statements get executed. Here I'm setting the "record seperator" (RS) variable to the percent sign. So rather than pulling the file apart line-by-line (awk's default RS value is newline), awk will treat each block of text between percent signs as an individual record.

Once that's happening, selecting the correct record is easy. We use our expression for picking a random fortune number and wait until awk has read that many records. The variable NR tracks the number of records seen, so when NR equals our random value we've reached the record we want to output. Since I don't have an action block after the conditional expression, "{ print }" is assumed and my fortune gets printed.

By the way, I'm sure that some of you are wondering why I'm using $RANDOM rather than the built-in rand() function in awk. Turns out that some versions of awk don't support rand(), so my method above is more portable. If your awk does support rand(), then the command would be:

$ awk "BEGIN { RS = \"%\"; srand(); sel = int(rand()*$numfortunes)+1 }; NR == sel" \


panic("Foooooooood fight!");
-- In the kernel source aha1542.c, after detecting a bad segment list

Frankly, the need to call srand() to reseed the random number generator at the start of the program, makes using the built-in rand() function a lot less attractive than just going with $RANDOM. By the way, our arithmetic is a little different here because rand() produces a floating point number between 0 and 1.

Meh. I like the $RANDOM version better.

So Tim, if you can stop deleting your own files for a second, let's see what you've got this week.

Tim steps into Mambi-pambi-land

We are 157 Episodes in and Hal (and Ed) still aren't up for manly commands; not willing to put it all on the line. Instead, we get fortune cookies. Alright, but once you guys grow some chest hair, let's throw down mano a mano computo a computo.

Let's start with cmd.exe. Similar to what Hal did we first need to figure out how many lines only contain a percent sign.

C:\> findstr /r "^%$" fortunes.txt | find /c "%"


We use FindStr with the /r to use a regular expression to look for the beginning of the line (^), a percent sign, end of line ($). Note, the file has to be saved with the Carriage Return Line Feed (CRLF) that Windows is used to, and not just a Carriage Return (CR) as text files are normally saved in Linux. The results are piped into Find with the /c switch to actually do the counting. But you mask ask, "Why both commands?"

Unfortunately, we can't just use Find, since there is no mechanism to ensure the percent sign is on a line by itself. We also can't just use FindStr as it doesn't count. Now that we have the number, lets cram it into a variable as an integer.

C:\> set /a count="findstr /r "%$" fortunes.txt ^| find /c ^"%^""

Divide by zero error.

I tried all sort of syntax options, different quotes, and escaping (using ^) to fix this error, but no luck. However, if you wrap it in a For loop and use the loop to handle the command output, it works. Why? Who knows. Don't come to cmd.exe if you are looking for things to make sense.

C:\> cmd.exe /v:on /c "for /F %i in ('findstr /r "^%$" fortunes.txt ^| find /c "%"') do @set /a x=%i"


This command uses delayed variable expansion (/v:on) so we can set a variable and use it right away. We then use a For loop that "loops" (only one loop) through the command output.

With a slight modification we can get a random fortune number.

C:\> cmd.exe /v:on /c "for /F %i in ('findstr /r "^%$" fortunes.txt ^| find /c "%"') do @set /a rnd=%random% % %i"

C:\> cmd.exe /v:on /c "for /F %i in ('findstr /r "^%$" fortunes.txt ^| find /c "%"') do @set /a rnd=%random% % %i"
C:\> cmd.exe /v:on /c "for /F %i in ('findstr /r "^%$" fortunes.txt ^| find /c "%"') do @set /a rnd=%random% % %i"
C:\> cmd.exe /v:on /c "for /F %i in ('findstr /r "^%$" fortunes.txt ^| find /c "%"') do @set /a rnd=%random% % %i"

We use the variable %RANDOM% and the modulus operator (%) to select a random number between 0 and 333 by using the method developed in Episode #49.

Now we need to find our relevant line(s) and display them. Of course, we will need another For loop to do this.

C:\> cmd.exe /v:on /c "(for /F %i in ('findstr /r "^%$" fortunes.txt ^| find /c "%"') do @set /a rnd=%random% % %i > NUL) & @set /a itemnum=0 > NUL & for /F "tokens=* delims=" %j in (fortunes.txt) do @(echo %j| findstr /r "^%$" > NUL && set /a itemnum=!itemnum!+1 > NUL || if !itemnum!==!rnd! echo %j)"

Be cheerful while you are alive.
-- Phathotep, 24th Century B.C.

Before our second For loop we initialize the itemnum counter, which will be used to keep track of the current comment number. We use base 0 for counting as that is what the modulus output gives us.

The options used with the For loop sets the tokens and delims options so we get the whole line (tokens=*) including leading spaces (delimes=<nothing>). Next we use Echo and FindStr to check if the current line contains only a percent sign. If the command has output it is successful, and with our short circuit logical And (&&) we increment the itemnum counter. If the line is not a percent sign, then the logical Or (||) will execute our If statement.

If the our itemnum counter matches the random number, then we output the current line. As the itemnum counter does not increment until the next time is sees a percent sign, it can output multiple lines of text.

To be honest, this command was a big pain. More than once I wished my `puter had been shot by that Russian bullet. At least the PowerShell version is much easier.


PowerShell is great with objects, so let's turn each fortune into an object.

PS C:\> $f = ((gc fortunes.txt) -join "`n") -split '^%$', 0, "multiline"

This command gives us an array of fortunes. We read in the file with Get-Content (alias gc). Get-Content will return an array of rows, but this isn't what we want. We then recombine all the lines using the New Line characters (`n) between each element. We then recut the string using the Split operator and some fancy options.

We give the split operator three parameters. The first is the regular expression to use in splitting. The second is the number of maximum number of substrings, where 0 means return everything. The third parameter is used to enable the MultiLine option so the split operator will handle multiple lines.

Now we have a list of fortunes and we can count how many.

PS C:\> $f.length


Wait, 335? What is going on? Let's check the last fortune. Remember, we are working with base 0, so the last item is 334.

PS C:\> $f[334]


This happens because the last item is a % and we have characters after it, a Carriage Return Line Feed. As long as we know this we can work around it. Now to output a random line number.

PS C:\> $f = ((gc fortunes.txt) -join "`n") -split '^%$', 0, "multiline"

PS C:\> $f[(Get-Random -Maximum $f.length) - 1]

Questionable day.

Ask somebody something.

PS C:\> $f[(Get-Random -Maximum $f.length) - 1]

Don't look back, the lemmings are gaining on you.

PS C:\> $f[(Get-Random -Maximum $f.length) - 1]

You need no longer worry about the future. This time tomorrow you'll be dead.

This may be my last week as I was just informed that "You will be traveling and coming into a fortune." YIPEE! I'm off to Tahiti! (Hopefully)

Tuesday, August 16, 2011

Episode #156: Row, Row, Row... You're Columns!

Hal receives stroking via email

I recently received an email from my old friend Frank McClain:

It is with much humility that I kneel before the masters and ask this request, which I am certain is but a simple task for such honored figures.

Well a little sucking up never hurts, Frank. Let's see what your issue is:

Tab-delimited text file containing multiple email addresses per row. The first such field is sender, and that's fine. The following fields are recipients. The first recipient can stay where it is, but the following for that row need to be moved individually into column-format below the first recipient, in new rows. If there is only one recipient in a row, nothing more needs to be done with that row.


7/27/2011    15:40:00    steve.jobes@place.com    jmarcus@someplace.com

ronsmith@someplace.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
Pamla.Barras@store.com pamlabs@webmail.com
5/14/2011 12:35:00 amartelli@someplace.com apiska@business.com
jmilch@provider.net pampwanla@webmail.com

What I need to end up with is:

7/27/2011    15:40:00    steve.jobes@place.com    jmarcus@someplace.com

7/27/2011 15:40:00 steve.jobes@place.com ronsmith@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
6/17/2011 15:19:00 ssummers@someplace.com Pamla.Barras@store.com
6/17/2011 15:19:00 ssummers@someplace.com pamlabs@webmail.com
5/14/2011 12:35:00 amartelli@someplace.com apiska@business.com
5/14/2011 12:35:00 amartelli@someplace.com jmilch@provider.net
5/14/2011 12:35:00 amartelli@someplace.com pampwanla@webmail.com

No worries, Frank. I got this one.

It's pretty clear to me that two nested loops are going to be required. We'll need one loop to read each line, and then another loop to output a series of lines listing each recipient individually:

$ while read date time from recips; do 

for r in $recips; do
echo -e "$date\t$time\t$from\t$r";
done <input-file

7/27/2011 15:40:00 steve.jobes@place.com jmarcus@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com ronsmith@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com
6/17/2011 15:19:00 ssummers@someplace.com Pamla.Barras@store.com
6/17/2011 15:19:00 ssummers@someplace.com pamlabs@webmail.com
5/14/2011 12:35:00 amartelli@someplace.com apiska@business.com
5/14/2011 12:35:00 amartelli@someplace.com jmilch@provider.net
5/14/2011 12:35:00 amartelli@someplace.com pampwanla@webmail.com

So the outer "while read ..." loop is what we're using to read the input file-- notice the "<input-file" hiding at the end of the loop construct. Since read will automatically split up fields on whitespace for us, we can quickly pull out the date, time, and from address. We then have one more variable, recips, that gobbles up everything else on the line-- i.e., all of the recipient addresses.

But the recipient addresses are themselves whitespace delimited, so we can just whack $recips down into our for loop and iterate over each email address in the list. For each one of those recipients we output a tab-delimited line of output containing $date, $time, $from, and the current recipient, $r. We need to use "echo -e" here so that the "\t"s get expanded as tabs.

Nothing could be easier. In fact, I bet Tim could even handle this one in CMD.EXE. But Frank was so moved by our solution that he replied:

Your meaningless servant is like unto a worm to be crushed beneath the might of your foot, nay, even but a toe. The mere fact that the Master has deemed to write an honored response to this insignificant gnat has caused tears of joy to stream in a veritable rain from my eyes, too blind to look upon the shining radiance of the Master.

Not much we can add to that.

Tim crushes worms

Because Frank asked so nicely (and because Hal threw me under the bus) I'll do some ugly cmd, first.

C:\> cmd.exe /v:on /c "for /f "tokens=1-25" %a in (input.txt) do @(

echo %a %b %c %d &&
echo %e | find "@" > NUL && echo %a %b %c %d %e &&
echo %f | find "@" > NUL && echo %a %b %c %d %f &&
echo %g | find "@" > NUL && echo %a %b %c %d %g &&
echo %y | find "@" > NUL && echo %a %b %c %d %y)"

In this command, we start off by reading our input file. The default delimiters of tab and space will work fine for us because 1) the only space we have is between the date and time and 2) using just the tab as a delimiter is a pain. We can do it, but we have to start a new shell with tab completion disabled, and I like tab completion.

Once we read the file we output the date (%a), time (%b), sender (%c), and the first recipient (%d). Next, we output the second recipient and see if it contains an "@". If it doesn't then our short circuit Logical And (&&) will stop the rest of the line from executing. If it does then we output the second recipient (%e). We do the same for the third (%f) through 22nd (%y) recipient (Frank said 22 was the max).

It isn't a brief command, but I do think it is quite elegant in its form and function. Building such a big command with just basic building blocks is like building fire with sticks. Any many times I feel that with cmd all I have is sticks.

Now for PowerShell...

The PowerShell version is pretty similar to what Hal did but with his Foreach loop replaced with a For loop and a little extra math.

PS C:\> gc input.txt | % {$s = $_.split("`t");

for ($i=2; $i -lt $s.length; $i++) { write-host $s[0] $s[1] $s[$i] } }

7/27/2011 15:40:00 steve.jobes@place.com jmarcus@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com ronsmith@someplace.com
7/27/2011 15:40:00 steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 15:19:00 ssummers@someplace.com kevin.smart@provider.com

We use Get-Content (alias gc) to read in our file. We then use the ForEach-Object cmdlet (alias %) to operate on each line. Each line is split, using tab as delimeter, and held in the array $s. We then use a for loop to output the 0th element (date), the 1st element (sender), and repent held in the Nth element (Ok, so technically the Ith element). This gives us output, but of course with PowerShell the right way to do it is with objects.

PS C:\> $r = "" | Select Date, Sender, Recipient

PS C:\> gc input.txt | % {$s = $_.split("`t"); $r.Date = (Get-Date $s[0]); $r.Sender = $s[1];
for ($i=2; $i -lt $s.length; $i++) {$r.Recipient = $s[$i]; $r}}

Date Sender Recipient
---- ------ ---------
7/27/2011 3:40:00 PM steve.jobes@place.com jmarcus@someplace.com
7/27/2011 3:40:00 PM steve.jobes@place.com ronsmith@someplace.com
7/27/2011 3:40:00 PM steve.jobes@place.com pgonzalez@someplace.com
6/17/2011 3:19:00 PM ssummers@someplace.com kevin.smart@provider.com

The approach is very similar to our original, the notable difference is the use of our custom object $r. To create this basic object we pipe nothing ("") into the Select-Object cmdlet (alias select) and select our new property names. This gives us our object with the properties we need. The shell of our object exists, but with no values.

Next, we use our same Get-Content cmdlet with our ForEach-Object loop. Instead of outputting the results, we set the relevant property in our object. In addition, the Date string is converted to a Date object so we could later use PowerShell's date comparisons and operators. Finally, we output the object.

Now, back to enjoying the groveling.

Tuesday, August 9, 2011

Episode #155: Copying Somebody Else's Work

Hal finds more gold in the mailbag

Just last Episode I was saying how much I like getting to critique other people's command lines, and lo and behold Philipp-- one of our intrepid readers-- sends me this little bit of fu to pick on:

In our company we were just recently talking about finding files according to the user that owns the files and copying/backing them up with the same structure of subdirectories to another directory. We as Linux guys came up with a solution pretty soon:

find . -user myuser -exec cp -a \{\} /path/to/directory \;

I'm not going to pick on this solution too much, since it solves Philipp's problem, but I will note a couple of issues here:

  1. As find traverses the directory structure, it's going to call "cp -a" on each file and directory. That means a lot of re-copying of the same files and directories over and over again as find descends through various levels in the directory tree.

  2. It sounds like Philipp only wants to copy files owned by a particular user. But the above solution will also copy files owned by other users if they live under a directory that's owned by the target user

Essentially Philipp's task is to find all files and directories owned by a particular user and replicate that structure in some other directory. And when I hear a task that's "find stuff that matches some set of criteria and copy it someplace else" I think of my little friend cpio:

find . -user myuser -depth | cpio -pd /path/to/directory

This will copy only the files owned by the given user with no extra copying, and the "-d" option to cpio will create directories as needed. So this seems like the most correct, straightforward approach to Philipp's conundrum.

At least for Unix folks, that is. I'll note that Philipp went on to "throw down the gauntlet" at the Windows half of our little team:

But the Windows guys got screwed a bit... So now I wanted to ask you if you know a [Windows] solution and if you want to hare it with me and/or the rest of the world in the blog.

How about it, Tim?

Tim is an original

Sorry to disappoint Hal, but this ain't too hard (even though it may be a bit more verbose).

PS C:\> Get-ChildItem -Recurse | Where-Object { (Get-Acl -Path $_).Owner -eq "mydomain\myuser" } |

Copy-Item -Destination "\SomeDir" -Recurse

We use a recursive directory listing and pipe it into our filter. In the filter the Owner property of the output from the Get-Acl cmdlet is compared against our target user. Any objects (files or directories) that match will be passed down the pipeline. From there the Copy-Item cmdlet does the heavy lifting; it accepts the input object and recursively copys it to the destination.

It should be noted that the same problems explained by Hal occur here as well. I would explain it again here, but I'm not a copy cat.

And for an additional trick, here is the same cmdlet, but shortened.

PS C:\> ls -r | ? { (Get-Acl -Path $_).Owner -eq "mydomain\myuser" } | cp -dest "\SomeDir" -r

So...how `bout that, Hal?

Tuesday, August 2, 2011

Episode #154: Line up alphabetically according to your size

Tim has been out and about

Hal and I have been busy the past weeks with SANS FIRE and then recuperating from said event. Oddly enough, that is the first time I ever met Hal. I would say something about how I hope it is the last, but I hear he reads this blog and I don't want to insult him publicly.

While we were away, one of our fantastic readers (at least I think he is fantastic) wrote in:

I've been reading the column for a while and when my boss asked me how to list all the directories in a path by size on a Linux system, I strung a bunch of stuff together quickly and thought I'd send it in to see what you thought:

$ SEARCHPATH=/home/username/; find $SEARCHPATH -type d -print0 |

xargs -0 du -s 2> /dev/null | sort -nr | sed 's|^.*'$SEARCHPATH'|'$SEARCHPATH'|' |
xargs du -sh 2> /dev/null

I'm sure you don't need an explanation but this finds all the directories in the given path, gets the size of each, sorts them numerically (largest first) and then removes the size from the front and prints the sizes again in a nice, human readable format.

Keep up the good work

Thank you! It is always great to hear from the readers, and we are always looking for new ideas that we can attempt in Windows (PowerShell and possibly cmd.exe) and in *nix-land. Keep sending ideas. On to the show...

The first portion of our command needs to gets the directories and their size. I wish I could say this command is simple in Windows, but it isn't. To get the size of a directory we need to sum the size (File Length property) of every object underneath the directory. Here is how we get the size of one directory:

PS C:\> Get-ChildItem -Recurse C:\Users\tim | Measure-Object -property Length -Sum

Count : 195
Average :
Sum : 4126436463
Maximum :
Minimum :
Property : Length

This command simple takes a recursive directory listing and sums the Lengths the objects. As files are the only objects with non-null Lengths, we get the combined size of all the files.

Take note, this command will take a while on directories with lots of files. When I tested it on the Windows directory it took nearly a minute. Also, the output isn't pretty. Unfortunately, displaying the size (4126436463) in human readable form is not super easy, but we'll come back to that later. First, let's display the directory name its Size.

PS C:\> Get-ChildItem C:\Users\tim | Where-Object { $_.PSIsContainer } | Select-Object FullName,

@{Name="Size";Expression={(Get-ChildItem -Recurse $_ | Measure-Object -property Length -Sum).Sum }}

FullName Size
-------- ----
C:\Users\tm\Desktop 330888989
C:\Users\tm\Documents 11407805
C:\Users\tm\Downloads 987225654

It works, but we would ideally like to keep the other properties of the directories objects, as that is the PowerShell way. To do this we use the Add-Member cmdlet, which we discuss in Episode #87. By adding a property to an existing object we can later use the properties further down the pipeline. We don't need the other objects down the pipeline for this example, but humor me. Here is what the full command using Add-Member looks like:

PS C:\> Get-ChildItem C:\Users\tim | Where-Object { $_.PSIsContainer } | ForEach-Object {

Add-Member -InputObject $_ -MemberType NoteProperty -PassThru -Name Length
-Value (Get-ChildItem -Recurse $_ | Measure-Object -property Length -Sum).Sum }

Directory: C:\Users\tm

Mode LastWriteTime Length Name
---- ------------- ------ ----
d-r-- 7/29/2011 2:50 PM 330889063 Desktop
d-r-- 7/25/2011 10:29 PM 11407805 Documents
d-r-- 7/29/2011 10:32 AM 987225654 Downloads

To sort, it is as simple as piping the previous command into Sort-Object (alias sort). Here is the shortened version of the command using aliases and shortened parameter names.

PS C:\> ls ~ | ? { $_.PSIsContainer } | % {

Add-Member -In $_ -N Length -Val (ls -r $_ | measure -p Length -Sum).Sum -MemberType NoteProperty -PassThru } |
sort -Property Length -Desc

Directory: C:\Users\tm

Mode LastWriteTime Length Name
---- ------------- ------ ----
d-r-- 7/29/2011 10:32 AM 987225654 Downloads
d-r-- 7/29/2011 2:50 PM 330889744 Desktop
d-r-- 7/25/2011 10:29 PM 11407805 Documents

The original *nix version of the command had to do some gymnastics to prepend the size, sort, remove the size, then add the human readable size to the end of each line. We don't have to worry about the back flips of moving the size around because we have objects and not just text. However, PowerShell does not easily do the human readable format (i.e. 10.4KB, 830MB, 4.2GB), but we can do something similar to Episode #79.

We can use Select-Object to display the Length property in different formats:

 PS C:\> <Previous Long Command> | format-table -auto Mode, LastWriteTime, Length,

@{Name="KB"; Expression={"{0:N2}" -f ($_.Length/1KB) + "KB" }},
@{Name="MB"; Expression={"{0:N2}" -f ($_.Length/1MB) + "MB" }},
@{Name="GB"; Expression={"{0:N2}" -f ($_.Length/1GB) + "GB" }},

Mode LastWriteTime Length KB MB GB Name
---- ------------- ------ -- -- -- ----
d-r-- 7/29/2011 10:32:57 AM 987225654 964,087.55KB 941.49MB 0.92GB Downloads
d-r-- 7/29/2011 2:50:38 PM 330890515 323,135.27KB 315.56MB 0.31GB Desktop
d-r-- 7/25/2011 10:29:53 PM 11407805 11,140.43KB 10.88MB 0.01GB Documents

We could add a few nested If Statements to pick between the KB, MB, and GB, but that is a script, and that's illegal here.

Let's see if Hal is more human readable.

Edit: Marc van Orsouw wrote in with another, shorter options using the filesystemobject and using the switch statement to display the size

PS C:\> (New-Object -ComObject scripting.filesystemobject).GetFolder('c:\mowtemp').SubFolders | 

sort size | ft name ,{switch ($_.size) {{$_.size -lt 1mb} {"{0:N2}" -f ($_.Size/1KB) + "KB" };
{$_.size -gt 1gb} {"{0:N2}" -f ($_.Size/1GB) + "GB" };default {"{0:N2}" -f ($_.Size/1MB) + "MB" }}}

Hal is about out

All I know is that the first night of SANSFIRE I had dinner with somebody who claimed to be Tim, but then I didn't see him for the rest of the week. What's the matter Tim? Did you only have enough money to hire that actor for one night?

The thing I found interesting about this week's challenge is that it clearly demonstrates the trade-off between programmer efficiency and program efficiency. There's no question that running du on the same directories twice is inefficient. But it accomplishes the mission with the minimum amount of programmer effort (unlike, say, Tim's Powershell solution-- holy moley, Tim!). This is often the right trade-off: if you were really worried about the answer coming back as quickly as possible, you probably wouldn't have tackled the problem with the bash command line in the first place.

But now I get to come along behind our illustrious reader and critique his command line. That'll make a nice change from having my humble efforts picked apart by the rest of you reading this blog (yes, I'm looking at you, Haemer!).

If you look at our reader's submission, everything before the "sort -nr" is designed to get a list of directories and their total size. But in fact our reader is just re-implementing the default behavior of du using find, xargs, and "du -s". "du $SEARCHPATH | sort -nr" will accomplish the exact same thing with much less effort.

In the second half of the pipeline, we take the directory names (now sorted by size) and strip off the sizes so we can push the directory list through "du -sh" to get human-readable sizes instead of byte counts. What I found interesting was that our reader was careful to use "find ... -print0 | xargs -0 ..." in the first part of the pipeline, but then apparently gives up on protecting against whitespace in the pathnames later in the command line.

But protecting against whitespace is probably a good idea, so let's change up the latter part of the command-line as well:

$ du testing | sort -nr | sed 's/^[0-9]*\t//' | tr \\n \\000 | xargs -0 du -sh

176M testing
83M testing/base64
46M testing/coreutils-8.7
24M testing/coreutils-8.7/po
8.1M testing/refpolicy
7.9M testing/webscarab
7.5M testing/ejabberd-2.1.2
6.2M testing/selenium
6.0M testing/refpolicy/policy
5.9M testing/refpolicy/policy/modules

I was able to simplify the sed expression by simply matching "some digits at the beginning of each line followed by a tab" ("^[0-9]*\t") and just throwing that stuff away by replacing it with the empty string. Then I use tr to convert the newline to a null so that we can use the now null-terminated path names as input to "xargs -0 ...".

So, yeah, I just ran du twice on every directory. But I accomplished the task with the minimum amount of effort on my part. And that's really what's important, isn't it?