Tuesday, June 28, 2011

Episode #151: Pathological PATHs

Hal gets some email

Spring is now officially over, but apparently it's never too late to think about Spring Cleaning... of your PATH that is. Jeff Haemer writes in to say that he often adds new directories to his search path with the "PATH+=:/some/new/dir" idiom (aka "PATH=$PATH:/some/new/dir"). But the problem is that if you do this frequently and indiscriminately, you can end up with lots of redundancy in your search path:

$ echo $PATH
/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/local/sbin:
/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/hal/bin:/sbin:/usr/sbin:/bin:
/usr/bin:/sbin:/usr/sbin:/bin:/usr/bin:/sbin:...

For bash, the redundancy here isn't really a huge factor since executable locations are automatically cached by the shell, specifically to avoid having to traverse the entire search path every time you run a program. Still, all those duplicates do make it difficult to see if a specific directory you're looking for is already included in $PATH or not.

So Jeff's email got me thinking about ways to reduce $PATH to be just the unique directory entries. My first idea ran along these lines:

$ echo $PATH | tr : \\n | sort -u | tr \\n : | sed 's/:$/\n/'
/bin:/home/hal/bin:/sbin:/usr/bin:/usr/games:/usr/local/bin:/usr/local/sbin:/usr/sbin:/usr/X11R6/bin

I first use "tr" to change the colon separators to newlines, essentially forcing each element of $PATH onto its own line. Then it's just a simple matter of using "sort -u" to reduce the list to only the unique elements. I then use "tr" again to turn the newlines back into colons. The only problem is that the very last newline also ends up becoming a colon, which isn't really what we want. So I added one last "sed" statements to the end in order to take care of that problem.

This definitely gives us only the unique path elements, but unfortunately it reorders them as well. Since directory order can be very significant when it comes to your search path, it seems like a different solution is warranted. So I decided to take matters into my own hands:

$ declare -A p
$ for d in $(echo $PATH | sed 's/:/ /g'); do
[[ ${p[$d]} ]] || echo -n $d:;
p[$d]=1;
done | sed 's/:$/\n/'

/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/games:/home/hal/bin

First I use "declare -A" to initialize a new associative array-- that is an array indexed with strings rather than numbers. I'll use the array "p" to track the directories I've already seen.

At the top of my for loop, I'm using sed to convert the colons in my path to spaces so that the loop will iterate over each separate directory element in $PATH. Inside the loop, I check to see if I've already got an entry in my array "p" for the given path element. If I don't then I output the new directory followed by a colon, but I make sure to use "echo -n" so I don't output a newline. I also make sure to update "p" to indicate that I've already seen and output the directory name.

Like my last example, however, this is going to give me final output that's terminated by a colon, but no newline. So I use the same "sed" fixup I did before so that the output looks nice.

It's a little scripty, but it gets the job done. I'm sure I could accomplish the same thing with some similar looking awk code, but it was fun trying to do this with just shell built-ins.

Tim, how's your late Spring Cleaning going?

Tim forgot to clean

Silly Hal and his cleanliness. Doesn't he know that us nerds don't like to be clean. Of course, we tend to be a bit anal retentive and we will definitely need to clean up our Path before picking up all the pizza boxes off the floor. Let's see what my path looks like:

PS C:\> $env:path
%SystemRoot%\system32\WindowsPowerShell\v1.0\;C:\Windows\system32;C:\Windows;
C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\system32;
C:\Windows\System 32\WindowsPowerShell\v1.0\;C:\Windows\system32;
C:\Program Files (x86)\QuickTime\QTSystem\


Uh oh, it's a little ugly and we can clean up the redundant bits. The easiest cleaning method is similar to what Hal did: split, sort, remove duplicates, rejoin.

PS C:\> ($env:path.split(';') | sort -Unique) -join ";"
%SystemRoot%\system32\WindowsPowerShell\v1.0\;
C:\Program Files (x86)\QuickTime\QTSystem\;C:\Windows;C:\Windows\system32;
C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\


We take the path (a string object) and use the split method to split on the semicolons. The results are piped into Sort-Object (alias sort) where the Unique switch is used to remove duplicates. Finally, the array of objects is passed to the Join operator to combine the items adding a semicolon between each item. Of course, we end up with the same problem that Hal had, a path that is out of order.

Fortunately, we can use a little trick with the Group-Object cmdlet (alias group) to find and remove duplicates.

PS C:\> $env:path.split(';') | group

Count Name
----- ----
1 %SystemRoot%\system32\WindowsPowerShell\v1.0\
4 C:\Windows\system32
3 C:\Windows
1 C:\Windows\System32\Wbem
1 C:\Windows\System32\WindowsPowerShell\v1.0\
1 C:\Program Files (x86)\QuickTime\QTSystem\


Notice that the items stay in order, so all we have to do is output the Name property and recombine the items.

PS C:\> ($env:path.split(';') | group | % {$_.Name}) -join ';'


The ForEach-Object cmdlet (alias %) is used to iterate through each group and output the Name property. The resulting array of strings are again joined via the Join operator and we have our fixed path.

Yeah all clean. Now to figure out what to do with all these pizza boxes.

Tuesday, June 21, 2011

Episode #150: Long Line of Evil

Tim goes long

While trying to trace down some rogue php shells I needed a way to find legitimate php files injected with bad code. It happened that the injected lines of code were usually quite long, so I needed a way to find php files with long lines of code, where long is defined as more than 150 characters. How would we find these files in Windows?

First off, let's try our old grumpy friend, cmd.exe.

C:\> for /R %i in (*.php) do @type %i | findstr "..........<140 more periods>" > NUL && @echo %i

C:\webroot\my1.php
C:\webroot\subdir\my4.php


This command does a recursive directory listing while looking for .php files. When it finds one it outputs the contents of the files (via the type command) and uses FindStr to find strings with at least 150 characters. When this happens the second half of our short circuit And statement is run, which outputs our file name (%i).

It will output the same file more than once, but we don't have much of a choice. It won't output the matching line number either, but it is functional. If we use PowerShell we can better targeted results, like this:

PS C:\> Get-ChildItem -Recurse -Include *.php | Select-String -Pattern '.{150,}' | Select-Object Path, LineNumber, Line

Path LineNumber Line
---- ---------- ----
C:\webroot\subdir\my4.php 3 This is a really really really...
C:\webroot\my1.php 9 This is a really really really...


This command does a recursive directory listing using Get-ChildItem with the -Recurse option. The -Include parameter is given to make sure we only check .php files. The resulting files are piped into Select-String where we find lines with at least 150 characters. We then output the matching file name, the matching line number, and the line itself.

Per usual, we can trim the command using aliases, shortened parameter names, and positional parameters.

PS C:\> ls -r -i *.php | Select-String '.{150,}' | Select-Object Path, LineNumber, Line


Oddly enough, when I did this I couldn't use windows (shudder!). I know what I used isn't as efficient as what Hal is about to do, so I won't bore you with my scripty Linux solution.

Hal goes longer

Actually, what's interesting to me about this week's challenge is that it's surprisingly difficult for such a relatively simple problem. Sort of like my relationship with Tim.

This issue here is that there's no built-in Unix primative to get the length of the longest line in a file. So we'll write our own:

$ max=''; \
while read line; do [[ ${#line} -gt ${#max} ]] && max="$line"; done </etc/profile; \
echo ${#max}: $max

70: # /etc/profile: system-wide .profile file for the Bourne shell (sh(1))

The trick here is using the bash built-in "${#variable}", which returns the number of characters in the string in $variable (unless the variable is an array, in which case it returns the number of elements in the array). So first I create an empty variable called "max" that I'll use to track my longest line. Then my while loop reads through my target file and compares the length of the current line to the length of the string currently in "max". If the new line is longer, I set max to be the newly crowned longest line. At the end of the loop, "max" will be the longest line in the file (technically it will be the first line of that longest length, but close enough), so I print out the length of our "max" line followed by the line itself.

So that will give us the longest line of a single file, but Tim's challenge is actually to find all files which contain a line that's longer than a certain fixed length. In some ways this makes our task easier, since we can stop reading the file as soon as we find any line that exceeds our length limit. But we have to add an extra find command to give us a list of files:

# find /etc -type f -exec /bin/bash -c \
'while read line; do [[ ${#line} -gt 150 ]] && echo {} && break; done < {}' \;

/etc/apt/trusted.gpg~
/etc/apt/trusted.gpg
/etc/apt/apt.conf.d/20dbus
/etc/apt/apt.conf.d/99update-notifier
...

Our while loop now becomes the argument of the "-exec /bin/bash -c ..." action at the end of the find command. And you'll notice that inside the while loop were just looking for any line that's longer than 150 characters. When we hit this condition we print out the file name and simply call "break" to terminate the loop and stop reading the file.

If you really want to see all the long lines from each file along with the file names, it actually makes our while loop a little simpler:

# find /etc -type f -exec /bin/bash -c \
'while read line; do [[ ${#line} -gt 150 ]] && echo {}:$line; done < {}' \;

...


So the final solution a little complicated, but stays well short of the borders of Scriptistan I'd say. And it's less than 150 characters long...

Davide for the touchdown!

Loyal reader Davide Brini writes in to note that the GNU version of wc actually has a "-L" switch that will output the longest line of a file. So on Linux systems, or any box that has GNU coreutils installed, we could use this option to find files with long lines:

find /etc -type f | xargs wc -L | awk '$1 > 150 {print $1}'

"wc -L" gives us the length of the longest line in the file, followed by the file name. So we use awk to see if the longest line is more than 150 characters, and if so we print out the file name.

But as long as we're using awk, Davide points out that we could just:

find /etc -type f -exec awk 'length > 150 {print FILENAME; exit}' {} \;

Here we're using "find ... -exec awk ..." to call awk on each file in turn. awk will call length() on every line of the file and if we hit a line that's longer than 150 characters, we'll spit out the FILENAME variable which awk helpfully sets for us and terminate awk so we go on to the next file.

And again, if you're on a system with all the GNU utilities installed, then you can do this even more efficiently:

find /etc -type f -exec awk 'length > 150 {print FILENAME; nextfile}' {} +

With GNU find, "-exec ... +" functions like "find ... | xargs ...", calling the awk program as little as possible, but using large groups of matching file names as arguments. The nice thing about the GNU version of awk is that you have the "nextfile" operator to stop reading from the current file and move on to the next one as soon as we encounter a long line.

Thanks, Davide, as always for your insight!

Tuesday, June 14, 2011

Epidose #149: Tiiiiime is on my file...

Tim checks the clock

This week we had a tough time coming up with episode ideas. We combed the internet looking for ideas and we finally came across this idea. And wouldn't you know, it just so happend that I actually needed to use this bit of fu this week.

What I wanted to know is how long how many seconds (or minutes) had passed since any file in a given directory had been modified. Unfortunately, the cmd.exe version of this command is big ol script and looks something like this, so we'll have to stick with the PowerShell version.

First, let's find the most recently modified file:

PS C:\> ls | ? { !$_.PSIsContainer } | sort LastWriteTime -desc | select -f 1

Directory: C:\

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 4/7/2011 1:43 PM 2880 myfile.txt


We start off by getting a directory listing by using Get-ChildItem. The results are piped into the Where-Object cmdlet to filter for objects that aren't containers, which leaves files. We then sort the objects by the LastWriteTime in descending order. Finally, we select the first object.

To get the date difference all we need to do is take LastWriteTime property of our object and subtract it from the current time. The verbose version of the command looks like this:

PS C:\> (Get-Date) - (Get-ChildItem | Where-Object { -not $_.PSIsContainer } |
Sort-Object -Property LastWriteTime -Descending | Select-Object -First 1).LastWriteTime


Days : 66
Hours : 10
Minutes : 26
Seconds : 34
Milliseconds : 623
Ticks : 57399946235000
TotalDays : 66.4351229571759
TotalHours : 1594.44295097222
TotalMinutes : 95666.5770583333
TotalSeconds : 5739994.6235
TotalMilliseconds : 5739994623.5


This is very similar to our last command, except for the subtraction and the output of the difference.

The command is a bit long, but we can shorten it using aliases, positional parameters, and shortened parameter names.

PS C:\> (Get-Date) - (ls | ? { !$_.PSIsContainer } | sort LastWriteTime -desc | select -f 1).LastWriteTime


If we just want the total number of seconds, we can get that too.

PS C:\> ((Get-Date) - (ls | ? { !$_.PSIsContainer } | sort LastWriteTime -desc | select -f 1).LastWriteTime).TotalSeconds

5739994.6235


So we have the most recent time, but what if we want the file name too? We can get the name of the file using a slightly different approach by adding a new property to the object:

PS C:\> (ls | ? { !$_.PSIsContainer } | sort LastWriteTime -desc | select -f 1) |
select name, @{Name='SecondsSinceMod';Expression={((Get-Date) - $_.LastWriteTime).TotalSeconds}}


Name SecondsSinceMod
---- ---------------
myfile.txt 5740999.9526015


The Select-Object cmdlet is used to select the Name and create a new property, SecondsSinceMod. A hashtable is used to create these properties, and we need to provide the name and an expression (value). The name is SecondsSinceMod and the Expression is what we did earlier.

This episode is quick, and easy. Oh, time, time, time is on my side...

Hal checks out

Well, personally, I can't get no satisfaction this week. If I restrict myself to the rules we set ourselves for the blog, there doesn't actually seem to be a general solution for this problem that works across all Unix platforms without resorting to a higher-level scripting language like Perl.

We've already covered an idiom for figuring out the most recently modified file in a directory, namely "ls -t | head -1". But the problem is getting ls to report out the timestamp on the file in seconds. The GNU version of ls actually has a (very non-standard) "--time-style" option for specifying a time output format. We can leverage this to solve the problem:

$ echo $(( $(date +%s) - $(ls -lt --time-style=+%s | awk 'NR == 2 {print $6}') ))
48106

To understand what's going on here, it helps to break things down into pieces. First we grab the output of "date +%s", which gives us the current time in Unix "epoch time" format (seconds since Jan 1, 1970). Then we use "ls -lt ..." to output the contents of the current directory sorted by last modified time, using "--time-style" to output the timestamp in epoch time format. This gets piped into awk, which prints out the time stamp field from the second line of output (the first line being a header from "ls -l"). All of that is wrapped up in a "echo $(( ... - ... )) expression which prints out the difference between the current time and the timestamp on the most recently modified file-- i.e., the number of seconds since that file was changed.

Another way of getting the mtime for a file in epoch time format is with the stat command-- at least if you're on Linux or BSD (Solaris doesn't seem to have implemented a stat command, for example). However, the option for specifying epoch time format is different depending whether you're using the Linux or BSD version of the command. The Linux version is "stat -c %Y <filename>", while BSD is "stat -f %m <filename>".

So a BSD solution for our challenge would be:

$ echo $(( $(date +%s) - $(stat -f %m $(ls -t | head -1)) ))
3423825

In this case we use our "ls -t | head -1" idiom to get the file name of the most recently modified file. That file name then becomes the argument to our stat command, which gives us the epoch time stamp for that file. We then use this value to compute the difference from the current time, just as we did in the last example.

In summary then: this challenge is solvable on Linux or any system that happens to have the GNU ls command installed. And for BSD, there's an option using the stat command. That's pretty decent coverage, but not a completely portable solution. And that's really sort of annoying.

I guess Tim gets one of his rare "wins" this week. Well played, sir. Well played!

Tuesday, June 7, 2011

Episode #148: Draggin' the Line

Hal is making a blog the old, hard way:

Almost 150 Episodes into this little blog experiment, and it's no secret that new ideas are getting hard to find. But like Tommy says, you've got to be "taking and giving... day by day." This week, I'm taking a couple of ideas from commandlinefu for deleting specific lines from a file.

First up, deleting a specific line by line number. DaveQB suggested a fairly tortured bit of shell code for removing a specific line from a file, mentioning that he used it to remove SSH keys from his known_hosts file when the remote host's key had changed. While another poster pointed out that you can just use "ssh-keygen -R <hostname>", this of course only works for known_hosts files. What if we wanted a solution that works more generally?

The canonical answer (which many folks on the thread pointed out) is to use sed:

sed -i.bak 3d ~/.ssh/known_hosts

The main part of the command here is the "3d". The leading number is treated as an "address", which is really just sed's fancy way of saying "line number". When you get to the right line number, the command that comes after the number is executed. In this case, that's the "d" command to simply delete the line.

Now normally sed would just emit the modified file to the standard output, but the "-i" flag means "in place" editing. In other words, the original file is replaced with the modified version. I'm always a little skittish with in place editing and make sure I always add an argument after the "-i". This causes a copy of the original file to be saved with the extension you specify. In this case I'll end up with a known_hosts and a known_hosts.bak file in my ~/.ssh directory.

But it's fairly rare that you actually know the number of the line you want to delete from the file. More likely there are lines matching a specific pattern that you want to delete. For example, I could purge all [t]csh users from my password file:

# grep -v 'csh$' /etc/passwd >/etc/passwd.new
# cp /etc/passwd /etc/passwd.bak
# mv /etc/passwd.new /etc/passwd

Or we could do that in one command with sed and some more in place editing action:

# sed -i.bak /csh$/d /etc/passwd

Here instead of matching a specific line number we're matching a regular expression and deleting every line that matchines.

But what if you didn't want to delete every single matching line? User flatcap suggested a sed-based approach for removing only the first matching line:

# sed -i.bak '1,/csh$/{/csh$/d}' /etc/passwd

The "1,/pattern/ {...}" syntax means "do this block from the first line up through and including the line that matches the given pattern". The only downside here is needing to repeat the pattern match inside the block so you don't purge all of the lines before the first match.

But this thread got me thinking about the more general problem of removing the Nth matching line from a given file. Unfortunately, sed isn't ideal for this because the only "variable" you really have to store things in is sed's hold space. But awk makes short work of our challenge:

# awk '/csh$/ {if (++c == 3) next}; {print}' /etc/passwd >/etc/passwd.new

We're keeping a count of the number of matching lines in the variable "c". In this case, when c is equal to 3, we skip over the matching line and it doesn't get printed out. This effectively deletes the third matching line from the file.

Since awk has compound boolean operators, you can even delete ranges of lines:

# awk '/csh$/ {if (++c > 3 && c < 10) next}; {print}' /etc/passwd >/etc/passwd.new

The above command will remove the 4th-9th matching lines from the file. I'm not exactly sure why this would ever be useful, but it's a good shell trivia question anyway.

The only downside to awk is that it doesn't have sed's in place editing mode. I could do something very similar with Perl-- which has the "-i" option for in place editing just like sed-- but our blog rules don't let me use Perl.

Now if Tim can get his dog Sam to stop eating those purple flowers, we might see some Windows action on this blog...

Tim is taking his time gettin' the good sign

Hal started off by deleting lines from a file that matched a certain string. This is the same as keeping lines that don't match the string. It's like a double negative, and that is exactly what Hal did and what we'll do here:

PS C:\> (gc file.txt) -notmatch "somestring" | sc newfile.txt


We use Get-Content (alias gc) to output the file. The -NotMatch operator will pass any line that doesn't match "somestring". Finally, we use Set-Content (alias sc) to write our file. The benefit of Set-Content over something like Out-File or a redirect (>) is that Set-Content will keep the same encoding and the others will default to unicode. Set-Content makse more sense as we just want to remove lines, not change more about the file.

Unfortunately, this command will load the entire file into memory first, so it may be a problem with big files. To do the same thing without loading the file into memory requires a command that is a slightly less terse.

PS C:\> gc file.txt | ? { $_ -notmatch "somestring" } | sc newfile.txt


All we do is pipe Get-Content into Where-Object (alias ?) and then into Set-Content. Not a big change, but it is a few characters longer.

We can accomplish a similar task with the cmd.exe:

C:\> type file.txt | find /v "somestring" > newfile.txt


We use the Type command to output our file and pipe it into Find. Find's /v option displays lines that don't contain our string.

Now Hal wants to make it complicated, and only remove the first instance of "somestring". We have to do a little "enhanced fu" (not be confused with a script).

PS C:\> $found = 0; gc file.txt | ? { $_ -notmatch "somestring" -or $found++ } | sc newfile.txt


This is a little confusing at first glance. Even after I came up with it, it still confused me for a few minutes. We start off by setting $found to 0 (false). We then read our file with Get-Content and pipe it into Where-Object (alias ?). Now here is the trick.

In the Where-Object script block we first check if the line doesn't match "somestring". Due to the short circuit nature of the PowerShell's logical operators, if NotMatch is true we don't need to check if $found is true (non-zero). When we come to a line that does contain "somestring", the first portion is false and we need to check the value of $found. The first time $found is evaluated it is 0 (false), and since the entire script block evaluates to false the line is not passed down the pipeline. However, when the value of $found is checked the increment operator (++) adds one to $found, so any future attempts will evaluate to true. Seem confusing? Here is a sample file and what will happens as we work through the file:

Line 1        # notmatch true;  $found not evaluated; object passed
Line 2 # notmatch true; $found not evaluated; object passed
asomestringa # notmatch false; $found false; object NOT passed; $found incremented
Line 3 # notmatch true; $found not evaluated; object passed
asomestringa # notmatch true; $found true; object passed
Line 4 # notmatch true; $found not evaluated; object passed


So far so good. But we have a little problem. Neither PowerShell or cmd.exe will automatically create a backup. We have to create a copy ourselves and then do our modification. I won't bore you with how to copy a file.

Next, Hal removed the 3rd line of our file. Here is how we do it in PowerShell:

PS C:\> gc file.txt | ? {$_.ReadCount -ne 3} | sc newfile.txt


We can use a similar technique to remove a range of lines (4th through 9th):

PS C:\> gc file.txt | ? { (4..9) -notcontains $_.ReadCount }


We got action, not as much as *nix. We ain't got much but what we got's ours.