Tuesday, May 31, 2011

Episode #147: Lines o' Code

Hal asks, "Dude, Where's My Blog?"

Wow, the last couple of weeks have been a blur, huh? I can't remember much, but now I have a lifetime supply of pudding in the kitchen, a new track suit, and a strange tattoo on my back. Plus there are a couple of weird Swedish sounding dudes hanging around.

But none of that shall deter us from getting back on our Command Line Kung Fu horse. Today's challenge is counting lines of code. Now I'm not a big fan of "lines of code" as a metric, but sometimes it's useful to know. However, I'm generally not interested in blank lines (which for me also includes lines that are entirely whitespace) and lines which only contain comments. So we'll want to filter those out before running the remaining lines through "wc -l".

sed is probably the easiest way to get rid of the stuff we don't want:

$ sed '/^[ \t]*$/d; /^[ \t]*#/d' .bashrc | wc -l
33

The first part of the sed expression-- "/^[ \t]*$/d"-- matches the beginning of line ("^") followed by zero or more instances of space or tab ("[ \t]*") followed by end of line ("$"). If we match, then the "d" after the regex means drop the current line without printing and move on to the next.

The second regex is similar. It matches any amount of whitespace from the start of the line ("^[ \t]*") followed by a comment character ("#"). Obviously, the comment character you match will vary by programming language, but here we're working with shell code ("#" also works fine for Perl and many other scripting languages).

But what if you had an entire directory structure of source code and you wanted to count the lines in all of the individual files? The problem with our command above is that the file name doesn't end up being included in the output. But we can dust off the technique we used in Episode #46:

$ find /usr/lib/perl5/5.8.8/ -name \*.pm | while read file; do 
echo -n "$file ";
sed '/^[ \t]*$/d; /^[ \t]*#/d' $file | wc -l;
done | awk '{t = t + $2; print $2 "\t" $1} END {print t "\tTOTAL"}'

...

1800 /usr/lib/perl5/5.8.8/Tie/File.pm
91 /usr/lib/perl5/5.8.8/Tie/Scalar.pm
206 /usr/lib/perl5/5.8.8/Tie/Array.pm
179 /usr/lib/perl5/5.8.8/Shell.pm
496 /usr/lib/perl5/5.8.8/diagnostics.pm
244 /usr/lib/perl5/5.8.8/AutoLoader.pm
66 /usr/lib/perl5/5.8.8/DirHandle.pm
118 /usr/lib/perl5/5.8.8/autouse.pm
129671 TOTAL

First we use find to locate all of the source code files we're interested in and pipe the list of file names into a while loop that reads them one by one. Within the loop we spit out the file name plus a space using "echo -n" to suppress the newline usually emitted by echo. That way, our "wc -l" command output ends up on the same line as the file name. Finally, just like in Episode #46 we pipe all the output into an awk expression that neatens up the file name and line count output (reversing the two fields and adding a tab) and even prints out a nice total at the end.

So, Tim, what have you got up your sleeve besides that "Sweet" new tattoo?

Tim's been stuck in a Chinese drive thru for past few weeks. NO AND THEN!

Hal and his silly short commands. He needs to man up and go for the man-sized commands, like this:

PS C:\> Select-String -NotMatch -Pattern "^\s*(#|$)" -Path file.txt | Measure-Object -Line

Lines Words Characters Property
----- ----- ---------- --------
33


Of course, we can shorten it up a bit with aliases, shortened parameter names, and positional parameters:

PS C:\> Select-String -n "^\s*(#|$)" file.txt | measure -l


The command looks for lines that don't match our regular expression. The regular expression looks for the following, in order:
  • "^" beginning of line

  • "\s*" any number, including zero ("*"), whitepace characters ("\s")

  • "(#|$)" Either the "#" character OR ("|") the end of line ("$")


We use Select-Object (alias select) to just output the line count.

PS C:\> Select-String -n "^\s*(#|$)" file.txt | measure -l | select lines

Lines
-----
33


The recursive directory listing is a bit more complicated, and there a few ways to do it. The most PowerShell-y way to do it is by outputting objects, and we'll use Group-Object (alias group) to do it.

PS C:\> Get-ChildItem -Recurse -Include *.pm | Select-String -NoElement -Pattern "^\s*(#|$)" | Group-Object -Property Path -NoElement

Count Name
----- ----
15 C:\perl\otherdir\i.pm
17 C:\perl\hate.pm
5 C:\perl\perl.pm


We start with a recursive directory listing where we only look for .pm files. The files are piped into the Select-String cmdlet were we look for the lines we want using the aforementioned regular expression. The results are piped into Group-Object to give us number of matching lines per file.

The entire command can be shortened using the regular techniques.

PS C:\> ls -r -i *.pm | Select-String -n "^\s*(#|$)" | group Path -n


Unfortunately, there isn't a nice way to get the total line count. We can do it using a variable to store our results. I'll keep it as multiple lines for easier reading.

PS C:\> $a = ls -r -i *.pm | Select-String -n "^\s*(#|$)" | group Path -n
PS C:\> $a

Count Name
----- ----
15 C:\perl\otherdir\i.pm
17 C:\perl\hate.pm
5 C:\perl\perl.pm

PS C:\> $a | Measure-Object -Sum -Property Count | Select-Object sum

Sum
---
35


"Dude!" What does mine say?

Tuesday, May 10, 2011

Episode #146: Hard CIDR

Tim is the mailman:

Most of our episode ideas come from the mailbag, and we can never have too many. Usually, we usually have too few, so keep sending in ideas. This episode was inspired by the modern art masterpiece sent in by Matt Graeber. His "piece" reads a file, expands a list of CIDR formatted IP addresses, and does a reverse DNS lookup against each IP address.

C:\> cmd.exe /v:on /c "FOR /F "tokens=1-5 delims=./" %G IN (ips.txt) DO
@(set network=%G.%H.%I.& set /a fourth="%J+1" 1^>nul & set mask=%K&
set /a end="(2 ^<^< (32-!mask!-1))-2+!fourth!" 1>nul &
echo !network!!fourth!-!end!& echo ****************** &
FOR /L %O IN (!fourth!,1,!end!) DO @(for /F "tokens=2" %P in
('nslookup !network!%O server 2^>nul ^| find "Name:"') do
@echo !network!%O %P & ping -n 3 127.0.0.1>nul))" 2>nul


That is some serious cmd madness. Since it is such beautiful art, we will leave it to the reader to behold and interpret for himself, but we will go over a more simplified version.

It should be noted, that all the commands presented today only work with 24+ bit masks. Using built in tools to expand the CIDR format for a mask less than 24 is significantly more work because we have to manipulate the third (and possibly second or first) octet(s) as well. We'll use the file below as our sample input file:

192.168.1.32/27
10.10.10.128/26
172.16.0.0/24


We'll simplify Matt's command so it only expands and outputs the IP addresses. From there, you can replace the echo command with the command of your choice. This CIDR expansion is very handy for wraping commands that only take a single IP as input.

The simplified version of the CIDR expander is:

C:\> cmd.exe /v:on /c "FOR /F "tokens=1-5 delims=./" %G IN (ips.txt) DO
@(set network=%G.%H.%I.& set /a fourth="%J" 1^>nul & set mask=%K&
set /a end="(2 ^<^< (32-!mask!-1))-1+!fourth!" 1>nul &
FOR /L %O IN (!fourth!,1,!end!) DO @echo !network!%O)" 2>nul


192.168.1.32
192.168.1.33
...
192.168.1.62
192.168.1.63
10.10.10.128
10.10.10.129
...
10.10.10.190
10.10.10.191
172.16.0.0
172.16.0.1
...
172.16.0.254
172.16.0.255


The command begins by enabling delayed variable expansion (explained at the end of Episode #46). The For Loop then reads our file and splits the text on each period and forward slash, so we have variables that hold each octet (%G-%J) as well as the number of bits in the subnet mask (%K). Next, a bit of cmd math is done to calculate the last IP address in the range. Finally, another For loop is used to output all of the IP addresses. Of course, we could do something else with the IP addresses instead of just displaying them, like ping or nslookup.

The math used to calculate the upper bound of our network uses a Logical Shift Left (<<). This operator shifts each bit to the left one location. So 2 (00000010) left shifted by 1 is 4 (00000100) and 2 left shifted twice is 8 (000010000). This operator is used to replace the nonexistent 2^x command.

PowerShell:

The PowerShell command to expand the addresses ranges is:

PS C:\> gc ips.txt | % {
$a=$_.split('./');
$a[3]..($a[3]-1+[Math]::Pow(2,32-$a[4])) | % {
echo ($a[0],$a[1],$a[2],$_ -join ".")
}
}


We start by reading our file using the Get-Content cmdlet (aliases gc and cat). Each line is piped into a ForEach-Object loop to be parsed and manipulated.

Inside our loop we split each string into an array of substrings, using period and forward slash as delimiters. We then end up with $a[0] through $a[3] holding the first through fourth octets and $a[5] is the mask.

The range operator (..) is use to generate a list of numbers between our lower bound (last octet) and the upper bound (end of network range). Again, we use a little math to determine the size of the range. In short, the size of the range is 2^(32-masklen). There isn't a native exponent operator, so we use the built in .NET Math class to do our power function.

The output of the range is fed into another ForEach-Object cmdlet. Inside this cmdlet's scriptblock is where the IP address is reassembled and output.

Now that we have the output, we can easily feed it to another command. For example, say we wanted to ping each of the IP addresses:

 PS C:\> ... | % { ping $_ }


No off to go find some hard cider to make this pain go away...

Hal is the eggman

Tim and I usually work up our pieces of the blog independently and then jam them together using a few insults as connective tissue. It's interesting that in this case my thought process developed along a parallel course with Tim's transition from CMD.EXE to Powershell.

My initial attempt looked a lot like Matt and Tim's CMD.EXE solutions:

while read line; do 
exp=${line/*\//};
base=${line/\/*/};
lastoct=${base/*./};
net=${base/%.$lastoct/};
for ((i=0; $i < $(( 2 ** (32-$exp) )); i++)); do
oct=$(($lastoct + $i));
echo $net.$oct;
done;
done <input

The outer while loop reads the input file line-by-line. Then I exploit the bash variable substitution operator ("${var/pattern/sub}") to pull the address and exponent apart, and to strip the last octet from the address. After that it's just a for loop to create the list of addresses. That little "echo" statement in the middle of the inner loop is the whole purpose of the exercise-- the rest is just grisly parsing and setup.

So I wasn't really happy with my solution. But then it occurred to me that I don't have to do the parsing. The shell will do it for me!

IFS=./
while read o1 o2 o3 o4 exp; do
for ((i=o4; $i < $(( $o4 + 2 ** (32-$exp) )); i++)); do
echo $o1.$o2.$o3.$i;
done;
done <input

The trick is to set IFS so that the shell automatically tokenizes each line as we "read" it in the while loop (essentially we're accomplishing the same thing that Tim does with his Powershell split() operator). If IFS is set to "./" then each octet and the exponent is broken out into a different variable and I don't have to do any parsing myself. This is a very useful technique when you're dealing with any "strongly delimited" input-- like /etc/passwd, /etc/shadow, and so on.

You'll also notice that I cleaned up my loop control a little bit so that I save myself a "$((...))" operation inside the inner loop. Rather than having the loop control variable run from 0..2**(32-exp) and then adding the value to the last octet inside the loop, it's easier to start the loop with the last octet value and increment from there.

By the way, some of you may be wondering why I didn't use the built-in range operator in bash and make my for loop be something like:

for $i in {$o4..$(($o4 + 2**(32-$exp) - 1))} ...

I actually tried this approach first. But it turns out that bash only lets you use literal values-- integers or characters-- in a "{a..z}" operation. This actually strikes me as something of a bug.

In any event, that's why I ended up opting for the C-style for loop instead. I could have used something like "$( seq $o4 $(($o4 + 2**(32-$exp) - 1)) )", but why call an external program when you don't have to?

So goo goo g'joob and have a great week everybody!

Tuesday, May 3, 2011

Episode #145: A Date to Copy

Tim checks the mail (again):

One of our readers writes in asking how to generate "a list of locations of doc files which are created or modified after month and year 01 /2011." He then wants to copy all those files to another directory. He specifically stated he wanted it to work on XP and without PowerShell.

Well, let me get on my soap box for a second. PowerShell v1 and v2 are available for XP. I can't recommend enough that people install PowerShell. It will make your Windows administration and command line experience easier and much less painful. Case in point, it took me a decent amount of time to write the cmd.exe portion of this episode, which includes all the trial and error. However, the PowerShell portion is so straight forward I wrote that portion off the top of my head. So if you are on a Windows box without PowerShell, go install it before continuing. Ok, enough ranting, back to our regularly scheduled programming...

One nice thing is that we are only looking for files based on year, so that makes our time parsing much easier in cmd.exe. If we had to parse based on month and year it would be exponentially harder with this antiquated shell.

We can use a number of the techniques taken from Episode #29. Specifically, the For loop with the ~f (full file path). We can use ~t with our variable to get the modified timestamp, but there isn't a modifier to give us the creation time, so we'll have to take a slightly different approach.

First, we start off with a For loop that will find all the .doc files on the system.

C:\> for /r %i in (*.doc) do @echo %~fi

C:\Doc\file1.doc
C:\Doc\file2.doc
...


We have all the .doc files, so now we take a giant leap to our final command:

C:\> for /r %i in (*.doc) do (@dir /tw %~fi | find "/2011" > NUL ||
@dir /tc %~fi | find "/2011" > NUL) && copy %~fi C:\DocsFrom2011


We use two Dir commands with different time (/t) options. The first, /tw, displays the Last Write Time. The second, /tc, displays the Creation Time. Each of these Dir commands is piped into a Find command that looks for the year portion of the date "/2011". All that is wrapped in a bit of magic logic that is used to determine if the file should be copied or not. The magic explained...

We use some Logical ORs (||) and a Logical And (&&) to do the magic. The cmd.exe shell uses short circuit logical operators, but what does that mean? In short, if we wanted to evaluate A && B the second expression (B) is only evaluated when the result is not fully determined by A. For example, given A && B, when A is false there is no reason to evaluate B since the Logical And will then be false regardless of the value of B. Enough of the Math/Logic lesson, back to our command...

The logic in our command reduces to: (Last Write Time Matches OR CreationDate Matches) AND copy. Given our short circuit logical operators, if neither the Creation or Last Write Times match, we don't do the copy. But if either matches, copy.

Now on to the much easier, PowerShell.

PowerShell

Not only is the PowerShell version is much easier, and much more robust, as it can actually compare dates. That means we could just as easily look for files created after any arbitrary date, not just January 1.

PS C:\> Get-ChildItem -Recurse -Include *.doc | Where-Object {
$_.LastWriteTime -ge "1/1/2011" -or $_.CreationTime -ge "1/1/2011" } |
Copy-Item -Destination C:\DocsFrom2011


The command can be shortened using built in aliases and shortened parameter names:

PS C:\> ls -r -i *.doc | ? { $_.LastWriteTime -ge "1/1/2011" -or $_.CreationTime -ge "1/1/2011" } |
cp -d C:\DocsFrom2011


This command does a recursive directory listing and pipes the results into a Where-Object filter. The filter only passes objects where the relevant timestamps are greater than, or equal to, Jan 1, 2011. All the objects that make it through the filter are sent to Copy-Item to be copied to our target folder.

The PowerShell version is included, even though this portion wasn't requested by our reader. I guess that makes me only half wanted. Since Hal's shell isn't supported on XP I guess he isn't wanted either, but here he is anyway.

Hal checks in

Not wanted? Don't go projecting your insecurities onto me now. Just repeat this daily affirmation, "CMD.EXE is good enough, and gosh darn it people just love XP!" Besides, you can always install Cygwin to help you over the rough spots.

This one really is similar to Episode #29-- we just have to do something with the file names once we pick them out. For that I think I'll bust out the cpio magic as I did back in Episode #115:

# touch -t 201101010000 /tmp/marker
# cd /some/source/dir
# find . -depth \( -name \*.doc -o -name \*.docx \) -newer /tmp/marker |
cpio -pd /path/to/target/dir

5854 blocks

First I use touch to create a file with the earliest date for files which we're interested in. I use this with "find ... -newer" to find all files that have been modified since this timestamp. Unfortunately traditional Unix file systems still don't track creation time on files (unless you've upgraded to EXT4), so last modified time is all we have to go on.

Our loyal reader specified that they were only interested in doc files, but I decided to make my life harder by looking for both the old style *.doc files and the newer *.docx format. Notice that the Windows guy in our little CommandLineKungFu partnership didn't think to look for both file extensions.

Notice that I'm also using the "-depth" option with find to work around any possible issues with read-only directory permissions when cpio is creating the parallel directory structure in the target directory. See the discussion in Episode #115 for more details.

So there you go, a solution to a new problem created by putting two earlier solutions together. And that's pretty much the Unix command-line religion anyway. And that's why Unix is cool. And, gosh darn it, people like it!