Tuesday, August 2, 2011

Episode #154: Line up alphabetically according to your size

Tim has been out and about

Hal and I have been busy the past weeks with SANS FIRE and then recuperating from said event. Oddly enough, that is the first time I ever met Hal. I would say something about how I hope it is the last, but I hear he reads this blog and I don't want to insult him publicly.

While we were away, one of our fantastic readers (at least I think he is fantastic) wrote in:


I've been reading the column for a while and when my boss asked me how to list all the directories in a path by size on a Linux system, I strung a bunch of stuff together quickly and thought I'd send it in to see what you thought:

$ SEARCHPATH=/home/username/; find $SEARCHPATH -type d -print0 |

xargs -0 du -s 2> /dev/null | sort -nr | sed 's|^.*'$SEARCHPATH'|'$SEARCHPATH'|' |
xargs du -sh 2> /dev/null


I'm sure you don't need an explanation but this finds all the directories in the given path, gets the size of each, sorts them numerically (largest first) and then removes the size from the front and prints the sizes again in a nice, human readable format.

Keep up the good work


Thank you! It is always great to hear from the readers, and we are always looking for new ideas that we can attempt in Windows (PowerShell and possibly cmd.exe) and in *nix-land. Keep sending ideas. On to the show...

The first portion of our command needs to gets the directories and their size. I wish I could say this command is simple in Windows, but it isn't. To get the size of a directory we need to sum the size (File Length property) of every object underneath the directory. Here is how we get the size of one directory:

PS C:\> Get-ChildItem -Recurse C:\Users\tim | Measure-Object -property Length -Sum


Count : 195
Average :
Sum : 4126436463
Maximum :
Minimum :
Property : Length


This command simple takes a recursive directory listing and sums the Lengths the objects. As files are the only objects with non-null Lengths, we get the combined size of all the files.

Take note, this command will take a while on directories with lots of files. When I tested it on the Windows directory it took nearly a minute. Also, the output isn't pretty. Unfortunately, displaying the size (4126436463) in human readable form is not super easy, but we'll come back to that later. First, let's display the directory name its Size.

PS C:\> Get-ChildItem C:\Users\tim | Where-Object { $_.PSIsContainer } | Select-Object FullName,

@{Name="Size";Expression={(Get-ChildItem -Recurse $_ | Measure-Object -property Length -Sum).Sum }}


FullName Size
-------- ----
C:\Users\tm\Desktop 330888989
C:\Users\tm\Documents 11407805
C:\Users\tm\Downloads 987225654
...


It works, but we would ideally like to keep the other properties of the directories objects, as that is the PowerShell way. To do this we use the Add-Member cmdlet, which we discuss in Episode #87. By adding a property to an existing object we can later use the properties further down the pipeline. We don't need the other objects down the pipeline for this example, but humor me. Here is what the full command using Add-Member looks like:

PS C:\> Get-ChildItem C:\Users\tim | Where-Object { $_.PSIsContainer } | ForEach-Object {

Add-Member -InputObject $_ -MemberType NoteProperty -PassThru -Name Length
-Value (Get-ChildItem -Recurse $_ | Measure-Object -property Length -Sum).Sum }


Directory: C:\Users\tm

Mode LastWriteTime Length Name
---- ------------- ------ ----
d-r-- 7/29/2011 2:50 PM 330889063 Desktop
d-r-- 7/25/2011 10:29 PM 11407805 Documents
d-r-- 7/29/2011 10:32 AM 987225654 Downloads
...


To sort, it is as simple as piping the previous command into Sort-Object (alias sort). Here is the shortened version of the command using aliases and shortened parameter names.

PS C:\> ls ~ | ? { $_.PSIsContainer } | % {

Add-Member -In $_ -N Length -Val (ls -r $_ | measure -p Length -Sum).Sum -MemberType NoteProperty -PassThru } |
sort -Property Length -Desc


Directory: C:\Users\tm

Mode LastWriteTime Length Name
---- ------------- ------ ----
d-r-- 7/29/2011 10:32 AM 987225654 Downloads
d-r-- 7/29/2011 2:50 PM 330889744 Desktop
d-r-- 7/25/2011 10:29 PM 11407805 Documents
...


The original *nix version of the command had to do some gymnastics to prepend the size, sort, remove the size, then add the human readable size to the end of each line. We don't have to worry about the back flips of moving the size around because we have objects and not just text. However, PowerShell does not easily do the human readable format (i.e. 10.4KB, 830MB, 4.2GB), but we can do something similar to Episode #79.

We can use Select-Object to display the Length property in different formats:

 PS C:\> <Previous Long Command> | format-table -auto Mode, LastWriteTime, Length,

@{Name="KB"; Expression={"{0:N2}" -f ($_.Length/1KB) + "KB" }},
@{Name="MB"; Expression={"{0:N2}" -f ($_.Length/1MB) + "MB" }},
@{Name="GB"; Expression={"{0:N2}" -f ($_.Length/1GB) + "GB" }},
Name


Mode LastWriteTime Length KB MB GB Name
---- ------------- ------ -- -- -- ----
d-r-- 7/29/2011 10:32:57 AM 987225654 964,087.55KB 941.49MB 0.92GB Downloads
d-r-- 7/29/2011 2:50:38 PM 330890515 323,135.27KB 315.56MB 0.31GB Desktop
d-r-- 7/25/2011 10:29:53 PM 11407805 11,140.43KB 10.88MB 0.01GB Documents
...


We could add a few nested If Statements to pick between the KB, MB, and GB, but that is a script, and that's illegal here.

Let's see if Hal is more human readable.

Edit: Marc van Orsouw wrote in with another, shorter options using the filesystemobject and using the switch statement to display the size

PS C:\> (New-Object -ComObject scripting.filesystemobject).GetFolder('c:\mowtemp').SubFolders | 

sort size | ft name ,{switch ($_.size) {{$_.size -lt 1mb} {"{0:N2}" -f ($_.Size/1KB) + "KB" };
{$_.size -gt 1gb} {"{0:N2}" -f ($_.Size/1GB) + "GB" };default {"{0:N2}" -f ($_.Size/1MB) + "MB" }}}


Hal is about out

All I know is that the first night of SANSFIRE I had dinner with somebody who claimed to be Tim, but then I didn't see him for the rest of the week. What's the matter Tim? Did you only have enough money to hire that actor for one night?

The thing I found interesting about this week's challenge is that it clearly demonstrates the trade-off between programmer efficiency and program efficiency. There's no question that running du on the same directories twice is inefficient. But it accomplishes the mission with the minimum amount of programmer effort (unlike, say, Tim's Powershell solution-- holy moley, Tim!). This is often the right trade-off: if you were really worried about the answer coming back as quickly as possible, you probably wouldn't have tackled the problem with the bash command line in the first place.

But now I get to come along behind our illustrious reader and critique his command line. That'll make a nice change from having my humble efforts picked apart by the rest of you reading this blog (yes, I'm looking at you, Haemer!).

If you look at our reader's submission, everything before the "sort -nr" is designed to get a list of directories and their total size. But in fact our reader is just re-implementing the default behavior of du using find, xargs, and "du -s". "du $SEARCHPATH | sort -nr" will accomplish the exact same thing with much less effort.

In the second half of the pipeline, we take the directory names (now sorted by size) and strip off the sizes so we can push the directory list through "du -sh" to get human-readable sizes instead of byte counts. What I found interesting was that our reader was careful to use "find ... -print0 | xargs -0 ..." in the first part of the pipeline, but then apparently gives up on protecting against whitespace in the pathnames later in the command line.

But protecting against whitespace is probably a good idea, so let's change up the latter part of the command-line as well:

$ du testing | sort -nr | sed 's/^[0-9]*\t//' | tr \\n \\000 | xargs -0 du -sh

176M testing
83M testing/base64
46M testing/coreutils-8.7
24M testing/coreutils-8.7/po
8.1M testing/refpolicy
7.9M testing/webscarab
7.5M testing/ejabberd-2.1.2
6.2M testing/selenium
6.0M testing/refpolicy/policy
5.9M testing/refpolicy/policy/modules
...

I was able to simplify the sed expression by simply matching "some digits at the beginning of each line followed by a tab" ("^[0-9]*\t") and just throwing that stuff away by replacing it with the empty string. Then I use tr to convert the newline to a null so that we can use the now null-terminated path names as input to "xargs -0 ...".

So, yeah, I just ran du twice on every directory. But I accomplished the task with the minimum amount of effort on my part. And that's really what's important, isn't it?