Tuesday, September 13, 2011

Revisiting Episode #151: Readers' Revenge!

Hal's a football widow

Well it's the start of football season here in the US, and Tim's locked himself in his "Man Cave" to catch all of the action. For our readers outside the US, our version of football is played with a "ball" that isn't at all round and which rarely touches the players' feet. We just called it football to confuse the rest of the world.

Since football really isn't my sport, I figured I'd spend some time this weekend catching up on reader responses to some of our past Episodes. Back in Episode #151 I sort of threw down the gauntlet at the end of my solution when I stated, "I'm sure I could accomplish the same thing with some similar looking awk code, but it was fun trying to do this with just shell built-ins." I figured that mention of an awk solution would bring an email from Davide Brini, and in this I was not disappointed.

Davide throws down

Let's just get straight to the awk, shall we:

echo -n $PATH | awk 'BEGIN { RS = ":" }; 
!a[$0]++ { printf "%s%s", s, $0; s = RS };
END { print "" }'

There's some sneaky clever bits here that bear some explanation:


  • In the BEGIN block, Davide is setting "RS"-- the "record separator" variable-- to colon. That means awk will treat each element of our input path as a separate record, automatically looping over each individual element and evaluating the statement in the middle of the example above.


  • That statement begins with a conditional operator, "!a[$0]", combined with an auto-increment, "++". In the conditional expression, "a" is an associative array that's being indexed with the elements of our $PATH. "$0" is the current "record" in the path that awk is giving us. So "!a[$0]" is true if we don't already have an entry for the current $PATH element in the array "a".


  • True or false, however, the auto-increment operator is going to add one to the value in "a[$0]", ensuring that if we run into a duplicate later in $PATH then the "!a[$0]" condition will return false.


  • If "!a[$0]" is true (it's the first time we've encounted a given directory in $PATH), then we execute the block after the conditional. That prints the value of variable "s" followed by the directory name, "$0". The first time through the loop, "s" will be null and we just print the directory. However, the second statement in the loop sets "s" to be colon (the value of "RS"), so in future iterations we'll print a colon before the directory name, so that everything gets nicely colon-separated.


  • In the end block, we output a null value. But this has the side effect of spitting out a newline at the end of our output, which makes things more readable.


Phew! That's some fancy awk, Davide. Thanks for the code!

Who let the shells out?

What I wasn't expecting was a note from loyal reader Daniel Miller, who took my whole "just shell built-ins" comment quite seriously. I had some sed mixed up in my final solution, but Daniel provided the following shell-only solution:

$ declare -A p
$ for d in ${PATH//:/ }; do [[ ${p[$d]} ]] || u[$((c++))]=$d; p[$d]=1; done
$ IFS=:
$ echo "${u[*]}"
/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/games:/home/hal/bin
$ unset IFS

I am limp with admiration. Daniel replaces the sed in my loop with the shell variable substitution operator "${var/.../...}" that we've used in previous Episodes. The clever bit, though, is that he's added a new array called "u" to the mix to keep track of the unique directory names, in order, as we progress through the elements of $PATH.

Inside the loop we check our associative array "p" as before to see whether we've encountered a given directory, $d, or not. If this is the first time we've seen $d, then "[[ ${p[$d]} ]]" will be false, and so we'll execute the statement after the "||", which adds the directory name to our array "u". The clever bit is the "$((c++))" in the array index, which uses "c" as an auto-incrementing counter variable to keep extending the "u" array as necessary to add new directory names.

You'll notice, however, that we're not outputting anything inside the loop. After the loop is finished, Daniel uses "echo "${n[*]}"" to output all of the elements of "n" with a single statement. The neat thing about the "${n[*]}" syntax is that it uses the first character of IFS to separate the array elements as they're being printed. So Daniel sets IFS to colon before the echo statement-- and then unsets it afterwards because having IFS set to colon is surely going to mess up later commands! In fact, Daniel suggests putting all of this mess into a shell function where you can declare IFS as a local variable and not mess up other commands.

Anyway, thanks as always to our readers for their efforts to improve our humble shell efforts. I'll see if I can drag Tim out of the Man Cave in time for next week's Episode...

Tuesday, September 6, 2011

Episode #158: The Old Switcheroo

Tim checks the mail

I went to the mailbox and what do you know, more mail! Chris Sakalis writes in:

Dear command line saolins,

The year before, I was given a bash assignment asking to search the Linux kernel code and replace occurrences of "Linus" with another string, while writing the filename, line number and the new line where a change was made in a file. While this has no practical use, it can be easily generalized to any search and replace/log operation. At first I thought sed was the best and fastest choice, but I couldn't manage writing the edited lines in a file. So I used bash built-in commands and made this:

<script>

However not only this isn't a one liner but also it's very slow. Actually, running grep and sed twice, one for the log and one for the actual replacement is faster.
Can you think of any way to turn this into a fast one liner?


Well sir, I can do it in one, albeit quite long, line!

PS C:\> Get-ChildItem -Recurse |  Where-Object { -not $_.PSIsContainer -and
(Select-String -Path $_ -Pattern Linus -AllMatches | Tee-Object -Variable lines) } |
ForEach-Object { $lines; $f = $_.FullName; Move-Item $f "$($f).orig";
Get-Content "$($f).orig" | ForEach-Object { $_ -replace "Linus", "Bill" } | Set-Content $f }


file1.txt:12:Linus
file1.txt:15:Some Text Linus Linus Linus Some Text
somedir\file3.txt:13:My Name is Linus
somedir\file3.txt:37:Blah Linus Blah


We start off by getting a recursive directory listing and piping it into the Where-Object cmdlet for filtering. The first portion of our filter looks for objects that aren't containers, so we just get files. Inside the filter we also search the file with Select-String to find all the occurrences of "Linus". The results are piped into Tee-Object which will output the data and save it in the variable $lines so we can display it later. That sounds redundant, but it isn't. Our filter needs to evaluate the The Select-String + Tee-Object as True or False so it can determine if it should pass the objects. Any non-null output will evaluate to True, while no output will evaluate to false. Any such output will be eaten by Where-Object so it won't be displayed. In short, if it finds a string in the file matching "Linus" it will evaluate to True. We are then left with objects that are files and contain "Linus". Now to do the renaming and search and replace.

The ForEach-Object cmdlet will operate on each file that makes it through the filter. We first output the $lines variable to display the lines that contained "Linus". Next, we save the full path of the file in the variable $f. The file is then renamed with an appended ".orig". Next, we use Get-Content to pipe the contents into another ForEach-Object cmdlet so we can operate on each line. Inside this loop we do the search and replace. Finally, the results are piped into Set-Content to write the file.

As usual, we can shorten the command using aliases and positional parameters.

PS C:\> ls -r |  ? { !$_.PSIsContainer -and   (Select-String -Path $_ -Pattern Linus -AllMatches | tee -var lines) } |
% { $lines; $f = $_.FullName; mv $f "$($f).orig"; gc "$($f).orig" | % { $_ -replace "Linus", "Bill" } | sc $f }


file1.txt:12:Linus
file1.txt:15:Some Text Linus Linus Linus Some Text
somedir\file3.txt:13:My Name is Linus
somedir\file3.txt:37:Blah Linus Blah


The output displayed above is the default output of the MatchInfo object. Since it is an object we could display it differently if we like by piping it into Select-Object and picking the properties we would like to see.

... $lines | Select-Object Path, LineNumber, Line ...

Path LineNumber Line
---- ---------- ----
C:\file1.txt 12 Linus
C:\file1.txt 15 Some Text Linus Linus Linus Some Text
C:\somedir\file3.txt 13 My Name is Linus
C:\somedir\file3.txt 37 Blah Linus Blah


Hal, do you have a one liner for us?

Hal checks his briefs

I don't think a one-liner is the way to go here. A little "divide and conquer" will serve us better.

There are really two problems in this challenge. The first is to do our string replacement, and I can handle that with a little find/sed action:

$ find testdir -type f | xargs sed -i.orig 's/Linus/Bill/g'

Here I'm using the "-i" option with sed to do "in place" editing. A copy of the unmodified file will be saved with the extension ".orig" and the original file name will contain the modified version. The only problem is that sed will make a *.orig copy for every file found-- even if it makes no changes. I'd actually like to clean away any *.orig files that are the same as the "new" version of the file, but I can take care of that in the second part of the solution.

We can use diff to find the changed lines in a given file. But the output of diff needs a little massaging to be useful:

$ diff foo.c foo.c.orig
2,3c2,3
< Bill
< wumpus Bill Bill Bill
---
> Linus
> wumpus Linus Linus Linus
5,6c5,6
< Bill wumpus Linux Linux Bill
< Bill
---
> Linus wumpus Linux Linux Linus
> Linus
8c8
< Bill
---
> Linus


I don't care about anything except the lines that look like "2,3c2,3". Those lines are giving me the changed line numbers ("change lines 2-3 in file #1 to look like lines 2-3 in file #2"). I can use awk to match the lines I want, split them on "c" ("-Fc") and print out the first set of line numbers. Something like this for example:

$ diff foo.c foo.c.orig | awk -Fc '/^[0-9]/ { print $1 }'
2,3
5,6
8

Then I can add a bit more action with tr to convert the commas to dashes and the newlines to commas:

$ diff foo.c foo.c.orig | awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,
2-3,5-6,8,

I've got a trailing comma and no newline, but I've basically got a list of the changed line numbers from a single file. Now all I need to do is wrap the whole thing up in a loop:

find testdir -name \*.orig | while read file; do 
diff=$(diff ${file/%.orig/} $file |
awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,);
[[ "$diff" ]] &&
echo ${file/%.orig/}: $diff ||
rm "$file";
done | sed 's/,$//'

In the first statement of the loop we assign the output of our diff pipeline to a variable called $diff. In the second statement of the loop I'm using the short-circuit logical operators "&&" and "||" as a quick and dirty "if-then-else". Essentially, if we got any output in $diff then we output the file name and the list of changed line numbers. Otherwise we remove the *.orig file because the file was not changed. Finally, I use another sed expression at the end of the loop to strip off the trailing commas from each line of output.

While this two-part solution works fine, Chris and I spent some time trying to figure out how to optimize the solution further (and, yes, laughing about how hard this week's challenge was going to be for Tim). Chris had the crucial insight that running sed on every file-- even if it doesn't include the string we want to replace-- and then having to diff every single file was a huge waste. By being selective at the beginning, we can actually save a lot of time:

# search and replace
find testdir -type f | xargs grep -l Linus | xargs sed -i.orig 's/Linus/Bill/g'

# Output changed lines
find testdir -name \*.orig | while read file; do
echo ${file/%.orig/}: $(diff ${file/%.orig/} $file |
awk -Fc '/^[0-9]/ { print $1 }' | tr ,\\n -,)
done | sed 's/,$//'

Notice that we've introduced an "xargs grep -l Linus" into the first shell pipeline. So the only files that get piped into sed are ones that actually contain the string we're looking to replace. That means in the while loop, any *.orig file we find will actually contain at least one change. So we don't need to have a conditional inside the loop anymore. And in general we have many fewer files to work on, which also saves time. For Chris' sample data, the above solution was twice as fast as our original loop.

So while it seems a little weird to use grep to search for our string before using sed to modify the files, in this case it actually saves us a lot of work. If nearly all of your input files contained the string you were replacing, then the grep would most likely make the solution take longer. But if the replacements are sparse, then pre-checking with grep is the way to go.

So thanks Chris for a fun challenge... and for creating another "character building" opportunity for Tim...

Steven can do that in one line!

Loyal reader Steven Tonge contacted us via Twitter with the following one-liner:

find testdir -type f | xargs grep -n Linus | tee lines-changed.txt | 
cut -f1 -d: | uniq | xargs sed -i.orig 's/Linus/Bill/g'

Bonus points for using tee, Steven!

The first part of the pipeline uses "grep -n" to look for the string we want to change. The "-n" outputs the line number of the match, and grep will automatically include the file name because we're grepping against multiple files. So the output that gets fed into tee looks like this:

testdir/foo.c:2:Linus
testdir/foo.c:3:wumpus Linus Linus Linus
testdir/foo.c:5:Linus wumpus Linux Linux Linus
testdir/foo.c:6:Linus
testdir/foo.c:8:Linus
testdir/bar.c:2:Linus
...

The tee command makes sure we save a copy of this output into the file lines-changed.txt, so that we have a record of the lines that were changed.

But tee also passes the output from grep along to the next part of the pipeline. Here we use cut to split out the file name, and uniq to make sure we only pass one copy of the file name along to our "xargs sed ..." command.

So Steven stumps the CLKF Masters with a sexy little one-liner. Awesome work!