Tuesday, September 6, 2011

Episode #158: The Old Switcheroo

Tim checks the mail

I went to the mailbox and what do you know, more mail! Chris Sakalis writes in:

Dear command line saolins,

The year before, I was given a bash assignment asking to search the Linux kernel code and replace occurrences of "Linus" with another string, while writing the filename, line number and the new line where a change was made in a file. While this has no practical use, it can be easily generalized to any search and replace/log operation. At first I thought sed was the best and fastest choice, but I couldn't manage writing the edited lines in a file. So I used bash built-in commands and made this:


However not only this isn't a one liner but also it's very slow. Actually, running grep and sed twice, one for the log and one for the actual replacement is faster.
Can you think of any way to turn this into a fast one liner?

Well sir, I can do it in one, albeit quite long, line!

PS C:\> Get-ChildItem -Recurse |  Where-Object { -not $_.PSIsContainer -and
(Select-String -Path $_ -Pattern Linus -AllMatches | Tee-Object -Variable lines) } |
ForEach-Object { $lines; $f = $_.FullName; Move-Item $f "$($f).orig";
Get-Content "$($f).orig" | ForEach-Object { $_ -replace "Linus", "Bill" } | Set-Content $f }

file1.txt:15:Some Text Linus Linus Linus Some Text
somedir\file3.txt:13:My Name is Linus
somedir\file3.txt:37:Blah Linus Blah

We start off by getting a recursive directory listing and piping it into the Where-Object cmdlet for filtering. The first portion of our filter looks for objects that aren't containers, so we just get files. Inside the filter we also search the file with Select-String to find all the occurrences of "Linus". The results are piped into Tee-Object which will output the data and save it in the variable $lines so we can display it later. That sounds redundant, but it isn't. Our filter needs to evaluate the The Select-String + Tee-Object as True or False so it can determine if it should pass the objects. Any non-null output will evaluate to True, while no output will evaluate to false. Any such output will be eaten by Where-Object so it won't be displayed. In short, if it finds a string in the file matching "Linus" it will evaluate to True. We are then left with objects that are files and contain "Linus". Now to do the renaming and search and replace.

The ForEach-Object cmdlet will operate on each file that makes it through the filter. We first output the $lines variable to display the lines that contained "Linus". Next, we save the full path of the file in the variable $f. The file is then renamed with an appended ".orig". Next, we use Get-Content to pipe the contents into another ForEach-Object cmdlet so we can operate on each line. Inside this loop we do the search and replace. Finally, the results are piped into Set-Content to write the file.

As usual, we can shorten the command using aliases and positional parameters.

PS C:\> ls -r |  ? { !$_.PSIsContainer -and   (Select-String -Path $_ -Pattern Linus -AllMatches | tee -var lines) } |
% { $lines; $f = $_.FullName; mv $f "$($f).orig"; gc "$($f).orig" | % { $_ -replace "Linus", "Bill" } | sc $f }

file1.txt:15:Some Text Linus Linus Linus Some Text
somedir\file3.txt:13:My Name is Linus
somedir\file3.txt:37:Blah Linus Blah

The output displayed above is the default output of the MatchInfo object. Since it is an object we could display it differently if we like by piping it into Select-Object and picking the properties we would like to see.

... $lines | Select-Object Path, LineNumber, Line ...

Path LineNumber Line
---- ---------- ----
C:\file1.txt 12 Linus
C:\file1.txt 15 Some Text Linus Linus Linus Some Text
C:\somedir\file3.txt 13 My Name is Linus
C:\somedir\file3.txt 37 Blah Linus Blah

Hal, do you have a one liner for us?

Hal checks his briefs

I don't think a one-liner is the way to go here. A little "divide and conquer" will serve us better.

There are really two problems in this challenge. The first is to do our string replacement, and I can handle that with a little find/sed action:

$ find testdir -type f | xargs sed -i.orig 's/Linus/Bill/g'

Here I'm using the "-i" option with sed to do "in place" editing. A copy of the unmodified file will be saved with the extension ".orig" and the original file name will contain the modified version. The only problem is that sed will make a *.orig copy for every file found-- even if it makes no changes. I'd actually like to clean away any *.orig files that are the same as the "new" version of the file, but I can take care of that in the second part of the solution.

We can use diff to find the changed lines in a given file. But the output of diff needs a little massaging to be useful:

$ diff foo.c foo.c.orig
< Bill
< wumpus Bill Bill Bill
> Linus
> wumpus Linus Linus Linus
< Bill wumpus Linux Linux Bill
< Bill
> Linus wumpus Linux Linux Linus
> Linus
< Bill
> Linus

I don't care about anything except the lines that look like "2,3c2,3". Those lines are giving me the changed line numbers ("change lines 2-3 in file #1 to look like lines 2-3 in file #2"). I can use awk to match the lines I want, split them on "c" ("-Fc") and print out the first set of line numbers. Something like this for example:

$ diff foo.c foo.c.orig | awk -Fc '/^[0-9]/ { print $1 }'

Then I can add a bit more action with tr to convert the commas to dashes and the newlines to commas:

$ diff foo.c foo.c.orig | awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,

I've got a trailing comma and no newline, but I've basically got a list of the changed line numbers from a single file. Now all I need to do is wrap the whole thing up in a loop:

find testdir -name \*.orig | while read file; do 
diff=$(diff ${file/%.orig/} $file |
awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,);
[[ "$diff" ]] &&
echo ${file/%.orig/}: $diff ||
rm "$file";
done | sed 's/,$//'

In the first statement of the loop we assign the output of our diff pipeline to a variable called $diff. In the second statement of the loop I'm using the short-circuit logical operators "&&" and "||" as a quick and dirty "if-then-else". Essentially, if we got any output in $diff then we output the file name and the list of changed line numbers. Otherwise we remove the *.orig file because the file was not changed. Finally, I use another sed expression at the end of the loop to strip off the trailing commas from each line of output.

While this two-part solution works fine, Chris and I spent some time trying to figure out how to optimize the solution further (and, yes, laughing about how hard this week's challenge was going to be for Tim). Chris had the crucial insight that running sed on every file-- even if it doesn't include the string we want to replace-- and then having to diff every single file was a huge waste. By being selective at the beginning, we can actually save a lot of time:

# search and replace
find testdir -type f | xargs grep -l Linus | xargs sed -i.orig 's/Linus/Bill/g'

# Output changed lines
find testdir -name \*.orig | while read file; do
echo ${file/%.orig/}: $(diff ${file/%.orig/} $file |
awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,)
done | sed 's/,$//'

Notice that we've introduced an "xargs grep -l Linus" into the first shell pipeline. So the only files that get piped into sed are ones that actually contain the string we're looking to replace. That means in the while loop, any *.orig file we find will actually contain at least one change. So we don't need to have a conditional inside the loop anymore. And in general we have many fewer files to work on, which also saves time. For Chris' sample data, the above solution was twice as fast as our original loop.

So while it seems a little weird to use grep to search for our string before using sed to modify the files, in this case it actually saves us a lot of work. If nearly all of your input files contained the string you were replacing, then the grep would most likely make the solution take longer. But if the replacements are sparse, then pre-checking with grep is the way to go.

So thanks Chris for a fun challenge... and for creating another "character building" opportunity for Tim...

Steven can do that in one line!

Loyal reader Steven Tonge contacted us via Twitter with the following one-liner:

find testdir -type f | xargs grep -n Linus | tee lines-changed.txt | 
cut -f1 -d: | uniq | xargs sed -i.orig 's/Linus/Bill/g'

Bonus points for using tee, Steven!

The first part of the pipeline uses "grep -n" to look for the string we want to change. The "-n" outputs the line number of the match, and grep will automatically include the file name because we're grepping against multiple files. So the output that gets fed into tee looks like this:

testdir/foo.c:3:wumpus Linus Linus Linus
testdir/foo.c:5:Linus wumpus Linux Linux Linus

The tee command makes sure we save a copy of this output into the file lines-changed.txt, so that we have a record of the lines that were changed.

But tee also passes the output from grep along to the next part of the pipeline. Here we use cut to split out the file name, and uniq to make sure we only pass one copy of the file name along to our "xargs sed ..." command.

So Steven stumps the CLKF Masters with a sexy little one-liner. Awesome work!