Tuesday, May 31, 2011

Episode #147: Lines o' Code

Hal asks, "Dude, Where's My Blog?"

Wow, the last couple of weeks have been a blur, huh? I can't remember much, but now I have a lifetime supply of pudding in the kitchen, a new track suit, and a strange tattoo on my back. Plus there are a couple of weird Swedish sounding dudes hanging around.

But none of that shall deter us from getting back on our Command Line Kung Fu horse. Today's challenge is counting lines of code. Now I'm not a big fan of "lines of code" as a metric, but sometimes it's useful to know. However, I'm generally not interested in blank lines (which for me also includes lines that are entirely whitespace) and lines which only contain comments. So we'll want to filter those out before running the remaining lines through "wc -l".

sed is probably the easiest way to get rid of the stuff we don't want:

$ sed '/^[ \t]*$/d; /^[ \t]*#/d' .bashrc | wc -l

The first part of the sed expression-- "/^[ \t]*$/d"-- matches the beginning of line ("^") followed by zero or more instances of space or tab ("[ \t]*") followed by end of line ("$"). If we match, then the "d" after the regex means drop the current line without printing and move on to the next.

The second regex is similar. It matches any amount of whitespace from the start of the line ("^[ \t]*") followed by a comment character ("#"). Obviously, the comment character you match will vary by programming language, but here we're working with shell code ("#" also works fine for Perl and many other scripting languages).

But what if you had an entire directory structure of source code and you wanted to count the lines in all of the individual files? The problem with our command above is that the file name doesn't end up being included in the output. But we can dust off the technique we used in Episode #46:

$ find /usr/lib/perl5/5.8.8/ -name \*.pm | while read file; do 
echo -n "$file ";
sed '/^[ \t]*$/d; /^[ \t]*#/d' $file | wc -l;
done | awk '{t = t + $2; print $2 "\t" $1} END {print t "\tTOTAL"}'


1800 /usr/lib/perl5/5.8.8/Tie/File.pm
91 /usr/lib/perl5/5.8.8/Tie/Scalar.pm
206 /usr/lib/perl5/5.8.8/Tie/Array.pm
179 /usr/lib/perl5/5.8.8/Shell.pm
496 /usr/lib/perl5/5.8.8/diagnostics.pm
244 /usr/lib/perl5/5.8.8/AutoLoader.pm
66 /usr/lib/perl5/5.8.8/DirHandle.pm
118 /usr/lib/perl5/5.8.8/autouse.pm
129671 TOTAL

First we use find to locate all of the source code files we're interested in and pipe the list of file names into a while loop that reads them one by one. Within the loop we spit out the file name plus a space using "echo -n" to suppress the newline usually emitted by echo. That way, our "wc -l" command output ends up on the same line as the file name. Finally, just like in Episode #46 we pipe all the output into an awk expression that neatens up the file name and line count output (reversing the two fields and adding a tab) and even prints out a nice total at the end.

So, Tim, what have you got up your sleeve besides that "Sweet" new tattoo?

Tim's been stuck in a Chinese drive thru for past few weeks. NO AND THEN!

Hal and his silly short commands. He needs to man up and go for the man-sized commands, like this:

PS C:\> Select-String -NotMatch -Pattern "^\s*(#|$)" -Path file.txt | Measure-Object -Line

Lines Words Characters Property
----- ----- ---------- --------

Of course, we can shorten it up a bit with aliases, shortened parameter names, and positional parameters:

PS C:\> Select-String -n "^\s*(#|$)" file.txt | measure -l

The command looks for lines that don't match our regular expression. The regular expression looks for the following, in order:
  • "^" beginning of line

  • "\s*" any number, including zero ("*"), whitepace characters ("\s")

  • "(#|$)" Either the "#" character OR ("|") the end of line ("$")

We use Select-Object (alias select) to just output the line count.

PS C:\> Select-String -n "^\s*(#|$)" file.txt | measure -l | select lines


The recursive directory listing is a bit more complicated, and there a few ways to do it. The most PowerShell-y way to do it is by outputting objects, and we'll use Group-Object (alias group) to do it.

PS C:\> Get-ChildItem -Recurse -Include *.pm | Select-String -NoElement -Pattern "^\s*(#|$)" | Group-Object -Property Path -NoElement

Count Name
----- ----
15 C:\perl\otherdir\i.pm
17 C:\perl\hate.pm
5 C:\perl\perl.pm

We start with a recursive directory listing where we only look for .pm files. The files are piped into the Select-String cmdlet were we look for the lines we want using the aforementioned regular expression. The results are piped into Group-Object to give us number of matching lines per file.

The entire command can be shortened using the regular techniques.

PS C:\> ls -r -i *.pm | Select-String -n "^\s*(#|$)" | group Path -n

Unfortunately, there isn't a nice way to get the total line count. We can do it using a variable to store our results. I'll keep it as multiple lines for easier reading.

PS C:\> $a = ls -r -i *.pm | Select-String -n "^\s*(#|$)" | group Path -n
PS C:\> $a

Count Name
----- ----
15 C:\perl\otherdir\i.pm
17 C:\perl\hate.pm
5 C:\perl\perl.pm

PS C:\> $a | Measure-Object -Sum -Property Count | Select-Object sum


"Dude!" What does mine say?