Tuesday, December 14, 2010

Episode #125: Find Yourself

Tim takes credit for someone else's work:

One of our faithful readers, John, wrote in. Well, we presume he is faithful to us, but we've heard he cheats on us with other blogs, and that's the worst kind of cheating. Since we are short of other ideas I guess we'll have to use his email.

Seriously though, John Ahearne has a nice bit of fu. On one particular assignment, John had carved over 1,200,000 files, where there were over 1,000 per directory. The files were named based on a particular file header in a proprietary file format. The client asked him to look for several files and gave him a text file with the file names. He started with this command to search for his files:

C:\> findstr /s /g:filestofind.txt

He used the command with the /s option to do a recursive search, and the /g option to load the search strings from a file. But there was a problem, slowness. The reason, this command searches inside the file, and we just want to search for the file name. He then tried another command to see if that would work more quickly.

C:\> dir /b /s | findstr /g:filestofind.txt

This is much quicker, and it searches what we actually want! How would we do the same thing in PowerShell?

PS C:\> ls -r -i (cat .\filestofind.txt)

Directory: C:\Windows\System32

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 7/13/2009 8:14 PM 301568 cmd.exe

Directory: C:\Windows\winsxs\x86_microsoft-windows-commandpro...

Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 7/13/2009 8:14 PM 301568 cmd.exe

We use Get-ChildItem (alias ls) with the Recursive option (r for short). Also used is the Include parameter (i for short), which is used to find items that match our search string where our search string is taken from the file, via Get-Content (alias cat). One other thing, notice a difference between the output of the two commands?

One command shows only files named "cmd.exe", the other looks for files containing "cmd.exe". The difference is due to the way each command expects the search strings to be presented. Here is a little chart describing how to get similar searches from each command:

Search TypePowerShellcmd.exe
Name is exactly cmd.execmd.exe^cmd.exe$
Name contains cmd.exe*cmd.exe*cmd.exe
Name ends with cmd.execmd.exe*cmd.exe$

Note, in the second case our cmd command will return any file with cmd.exe in the path, so one of the other options might be a better choice.

We can get the same results with each command. Obviously, when searching 1,200,000 files we want to use the faster command. Let's do a little test to see which is faster. We'll use search strings that return identical results. More specifically, we'll use the search string that exactly matches a file named cmd.exe. Before each search I modified the file filetofind.txt accordingly. Now how do we measure the duration of each command?

PowerShell has the measure-command cmdlet, but cmd.exe does not have a way to measure time. However, Ed used a cool method in episode #49 that I'll borrow.

PS C:\> measure-command { ls -r C:\Windows -Include (cat .\filestofind.txt) } | Select TotalSeconds


C:\> cmd.exe /v:on /c "echo !time! & (dir C:\Windows /s /b | findstr /g:filestofind.txt > NUL) & echo !time!"

Cmd.exe took just 16.36 seconds, which is 3.4 times faster than PowerShell's 55.7 seconds. Wow! The cmd.exe command is obviously the command we are going to use.

After John found the files, he needed a way to copy the files to a location of his choosing. Here is the cool little command he came up with:

C:\> dir /b /s | findstr /g:filestofind.txt > c:\foundthem.txt &
FOR /F %i in (d:\foundthem.txt) do copy %i d:\neededfiles\

This takes the output from our search and dumps it into foundthem.txt. We then use a For loop to read the contents of the file and copy each file to the neededfiles directory.

Well done John.

I have to say thanks to John, since he came up with the idea and wrote the commands; making my life much easier. I wonder if Hal has found anyone to write his episode for him?

Hal stands alone

Geez, Tim, since John did all your work for you I was sort of hoping that you'd write the Unix bit this week. Some friend and co-author you are!

The Unix equivalent of what John's trying to do would be something like this:

# find /etc -type f | grep -f filestofind.txt

Here I'm using find to output a list of all regular files ("-type f") under /etc. Then I pipe that output into a grep command and use the "-f" option to tell grep to read a list of patterns from a text file. In this case my patterns were things like "passwd", "shadow", "group", and so on, which actually match a surprisingly large number of files under /etc.

Since were talking about performance improvements here, it's worth noting that if you're searching for fixed strings rather than regular expressions, then using fgrep is going to be faster:

# time find /etc -type f | grep -f filestofind.txt >/dev/null

real 0m0.052s
user 0m0.030s
sys 0m0.020s
# time find /etc -type f | fgrep -f filestofind.txt >/dev/null

real 0m0.026s
user 0m0.010s
sys 0m0.010s

Now /etc is a fairly small directory-- we'd probably get better numbers if we tried running this on a larger portion of the file system. And we should probably run multiple trials to get a more average value. But at least in this case you can see that fgrep is twice as fast as grep.

John's actual challenge is to copy the matching files into another directory. We can use the cpio trick from Episode #115 to actually copy the files:

# find /etc -type f | fgrep -f filestofind.txt | cpio -pd /root/saved
39 blocks

"cpio -p" is the "pass through" option that reads file/directory names from the standard input and copies them from their current location to the directory name specified with "-d". You don't even have to create the target directory-- if it doesn't exist, cpio will create it for you.

So this one really wasn't that difficult. Tim may need our readers to help him, but us Unix folks can get it done on our own.