Tuesday, June 7, 2011

Episode #148: Draggin' the Line

Hal is making a blog the old, hard way:

Almost 150 Episodes into this little blog experiment, and it's no secret that new ideas are getting hard to find. But like Tommy says, you've got to be "taking and giving... day by day." This week, I'm taking a couple of ideas from commandlinefu for deleting specific lines from a file.

First up, deleting a specific line by line number. DaveQB suggested a fairly tortured bit of shell code for removing a specific line from a file, mentioning that he used it to remove SSH keys from his known_hosts file when the remote host's key had changed. While another poster pointed out that you can just use "ssh-keygen -R <hostname>", this of course only works for known_hosts files. What if we wanted a solution that works more generally?

The canonical answer (which many folks on the thread pointed out) is to use sed:

sed -i.bak 3d ~/.ssh/known_hosts

The main part of the command here is the "3d". The leading number is treated as an "address", which is really just sed's fancy way of saying "line number". When you get to the right line number, the command that comes after the number is executed. In this case, that's the "d" command to simply delete the line.

Now normally sed would just emit the modified file to the standard output, but the "-i" flag means "in place" editing. In other words, the original file is replaced with the modified version. I'm always a little skittish with in place editing and make sure I always add an argument after the "-i". This causes a copy of the original file to be saved with the extension you specify. In this case I'll end up with a known_hosts and a known_hosts.bak file in my ~/.ssh directory.

But it's fairly rare that you actually know the number of the line you want to delete from the file. More likely there are lines matching a specific pattern that you want to delete. For example, I could purge all [t]csh users from my password file:

# grep -v 'csh$' /etc/passwd >/etc/passwd.new
# cp /etc/passwd /etc/passwd.bak
# mv /etc/passwd.new /etc/passwd

Or we could do that in one command with sed and some more in place editing action:

# sed -i.bak /csh$/d /etc/passwd

Here instead of matching a specific line number we're matching a regular expression and deleting every line that matchines.

But what if you didn't want to delete every single matching line? User flatcap suggested a sed-based approach for removing only the first matching line:

# sed -i.bak '1,/csh$/{/csh$/d}' /etc/passwd

The "1,/pattern/ {...}" syntax means "do this block from the first line up through and including the line that matches the given pattern". The only downside here is needing to repeat the pattern match inside the block so you don't purge all of the lines before the first match.

But this thread got me thinking about the more general problem of removing the Nth matching line from a given file. Unfortunately, sed isn't ideal for this because the only "variable" you really have to store things in is sed's hold space. But awk makes short work of our challenge:

# awk '/csh$/ {if (++c == 3) next}; {print}' /etc/passwd >/etc/passwd.new

We're keeping a count of the number of matching lines in the variable "c". In this case, when c is equal to 3, we skip over the matching line and it doesn't get printed out. This effectively deletes the third matching line from the file.

Since awk has compound boolean operators, you can even delete ranges of lines:

# awk '/csh$/ {if (++c > 3 && c < 10) next}; {print}' /etc/passwd >/etc/passwd.new

The above command will remove the 4th-9th matching lines from the file. I'm not exactly sure why this would ever be useful, but it's a good shell trivia question anyway.

The only downside to awk is that it doesn't have sed's in place editing mode. I could do something very similar with Perl-- which has the "-i" option for in place editing just like sed-- but our blog rules don't let me use Perl.

Now if Tim can get his dog Sam to stop eating those purple flowers, we might see some Windows action on this blog...

Tim is taking his time gettin' the good sign

Hal started off by deleting lines from a file that matched a certain string. This is the same as keeping lines that don't match the string. It's like a double negative, and that is exactly what Hal did and what we'll do here:

PS C:\> (gc file.txt) -notmatch "somestring" | sc newfile.txt

We use Get-Content (alias gc) to output the file. The -NotMatch operator will pass any line that doesn't match "somestring". Finally, we use Set-Content (alias sc) to write our file. The benefit of Set-Content over something like Out-File or a redirect (>) is that Set-Content will keep the same encoding and the others will default to unicode. Set-Content makse more sense as we just want to remove lines, not change more about the file.

Unfortunately, this command will load the entire file into memory first, so it may be a problem with big files. To do the same thing without loading the file into memory requires a command that is a slightly less terse.

PS C:\> gc file.txt | ? { $_ -notmatch "somestring" } | sc newfile.txt

All we do is pipe Get-Content into Where-Object (alias ?) and then into Set-Content. Not a big change, but it is a few characters longer.

We can accomplish a similar task with the cmd.exe:

C:\> type file.txt | find /v "somestring" > newfile.txt

We use the Type command to output our file and pipe it into Find. Find's /v option displays lines that don't contain our string.

Now Hal wants to make it complicated, and only remove the first instance of "somestring". We have to do a little "enhanced fu" (not be confused with a script).

PS C:\> $found = 0; gc file.txt | ? { $_ -notmatch "somestring" -or $found++ } | sc newfile.txt

This is a little confusing at first glance. Even after I came up with it, it still confused me for a few minutes. We start off by setting $found to 0 (false). We then read our file with Get-Content and pipe it into Where-Object (alias ?). Now here is the trick.

In the Where-Object script block we first check if the line doesn't match "somestring". Due to the short circuit nature of the PowerShell's logical operators, if NotMatch is true we don't need to check if $found is true (non-zero). When we come to a line that does contain "somestring", the first portion is false and we need to check the value of $found. The first time $found is evaluated it is 0 (false), and since the entire script block evaluates to false the line is not passed down the pipeline. However, when the value of $found is checked the increment operator (++) adds one to $found, so any future attempts will evaluate to true. Seem confusing? Here is a sample file and what will happens as we work through the file:

Line 1        # notmatch true;  $found not evaluated; object passed
Line 2 # notmatch true; $found not evaluated; object passed
asomestringa # notmatch false; $found false; object NOT passed; $found incremented
Line 3 # notmatch true; $found not evaluated; object passed
asomestringa # notmatch true; $found true; object passed
Line 4 # notmatch true; $found not evaluated; object passed

So far so good. But we have a little problem. Neither PowerShell or cmd.exe will automatically create a backup. We have to create a copy ourselves and then do our modification. I won't bore you with how to copy a file.

Next, Hal removed the 3rd line of our file. Here is how we do it in PowerShell:

PS C:\> gc file.txt | ? {$_.ReadCount -ne 3} | sc newfile.txt

We can use a similar technique to remove a range of lines (4th through 9th):

PS C:\> gc file.txt | ? { (4..9) -notcontains $_.ReadCount }

We got action, not as much as *nix. We ain't got much but what we got's ours.