Friday, April 10, 2009

Episode #21: Finding & Locating Files

Paul Writes In:

I'm one of those messy desktop people. There I said it, I keep a messy desktop with tons of files all over the place (partly due to the fact that when you do <4> in OS X to take a screen grab it puts the file on the desktop). So, it should some as no suprise that I often need help finding files. I don't know how many of you have actually run the find command in OS X (or even Linux), but it can be slow:

# time find / -name msfconsole
real 14m3.648s
user 0m17.783s
sys 2m29.870s

I actually stopped it at around 15 minutes because I couldn't wait that long. There are many factors in the performance equation of the above command, such as the overall speed of the system, how busy the system is when you execute the command, and some even say that find is slower if its checking across different file system types ("/" would also include mounted USB drives). A quicker way to find files is to use the locate command:

$ locate msfconsole | grep -v .svn

This command reads from a database (which is generated on a regular basis) that consists of a listing of files on the system. It's MUCH faster:

$ time locate msfconsole
real 0m1.205s
user 0m0.298s
sys 0m0.050s

I'm wondering what Ed's going to do on Windows, unless he's come up with a way to get an animated ASCII search companion dog. :)

Hal Says:

One thing I will note about the locate command is that it's going to do sub-expression matching, whereas "find ... -name ..." will do an exact match against the file name. To see the difference, check out the following two commands:

# find / -name vmware
# locate vmware
[... 5000+ addtl lines of output not shown ...]

Also, as Paul notes above, the database used by the locate command is updated regularly via cron. The program that builds the database is updatedb, and you can run this by hand if you want to index and search your current file system image, not the image from last night.

I was curious whether doing a find from the root was faster than running updatedb followed by locate. Note that before running the timing tests below, I did a "find / -name vmware" to force everything into the file cache on my machine. Then I ran:

# time find / -name vmware >/dev/null
real 0m1.223s
user 0m0.512s
sys 0m0.684s

# time updatedb
real 0m0.263s
user 0m0.128s
sys 0m0.132s

# time locate vmware >/dev/null
real 0m0.314s
user 0m0.292s
sys 0m0.016s

It's interesting to me that updatedb+locate is twice as fast as doing the find. I guess this shouldn't really be that surprising, since find is going to end up calling stat(2) on every file whereas updatedb just has to collect file names.

Ed Kicks in Some Windows Stuff:

In Windows, the dir command is often used to search for files with a given name. There are a variety of ways to do this. One of the most obvious but less efficient ways to do this involves running dir recursively (/s) scraping through its results with the find or findstr command to look for what we want. I'll use the findstr command here, because it gives us more extensibility if we want to match on regex:

C:\> dir /b /s c:\ | findstr /i vmware
There are a couple of things here that may not be intuitive. First off, what's with the /b? This indicates that we want the bare form of output, which will omit the extra stuff dir adds to a directory listing, including the volume name, number of files in a directory, free bytes, etc. But, when used with the /s option to recurse subdirectories, /b takes on an additional meaning. It tells dir to show full paths to files, which is what we really want to see to know the file's location. Try running the command without /b, and you'll see that it doesn't show what we want. The /b makes it show what we want: the full path to the file so we know its location. Oh, and the /i makes findstr case insensitive.

But, you know, dumping all of the directory and file names on standard out and then scraping through them with findstr is incredibly inefficient. There is a better way, more analogous to the "find / -name" feature Paul and Hal use above:

C:\> dir /b /s c:\*vmware*

This command seems to imply that it will simply look inside of the c:\ directory itself for vmware, doesn't it? But, it will actually recurse that directory looking for matching names because of the /s. And, when it finds one, it will then display its full path because of the /b. I put *vmware* here to make this look for any file that has the string vmware in its name so that its functionality matches what we had earlier. If you omit the *'s, you'll only see files and directories whose name exactly matches vmware. This approach is significantly faster than piping things through the findstr command. Also note that it is automatically case insensitive, because, well, that's the way that dir rolls.

How much faster? I'm going to use Cygwin so I can get the time command for comparison. The $ prompt you see below is from Cygwin running on my XP box:

$ time cmd.exe /c "dir /s /b C:\ | findstr /i vmware > nul"

real 0m10.672s
user 0m0.015s
sys 0m0.015s

Now, let's try the other approach:

$ time cmd.exe /c "dir /s /b C:\*vmware* > nul"
real 0m6.484s
user 0m0.015s
sys 0m0.031s

It takes about half the time doing it this more efficient way. Oh, and note how I'm using the Cygwin time command here. I use time to invoke a cmd.exe with the /c option, which will make cmd.exe run a command for me and then go away when the command is done. Cygwin's time command will then show me how long the command took. I use time to invoke a cmd.exe /c rather than directly invoking a dir so that I can rely on the dir command built-into cmd.exe instead of running the dir command included in Cygwin.

OK... so we have a more efficient way of finding files than simply scraping through standard output of dir. But, what about an analogous construct to the locate command that Hal and Paul talk about above? Well, Windows 2000 and later include the the Indexing Service, designed to make searching for files more efficient by creating an index. You can invoke this service at the command line by running:

C:\> sc start cisvc

Windows will then dutifully index your hard drive, making searches faster. What kind of searches? Well, let's see what it does for our searches using dir:

$ time cmd.exe /c "dir /s /b C:\*vmware* > nul"
real 0m6.312s
user 0m0.015s
sys 0m0.046s

Uh-oh... The Windows indexing service doesn't help the dir command, whether used this way or in combination with the find command. Sorry, but dir doesn't consult the index, and instead just looks through the complete file system directory every time. But, the indexing service does improve the performance of the Start-->Search GUI based search. You can control which directories are included in the index via a GUI tool that can be accessed by running:

C:\> ciadv.msc
Also, in that GUI, if you select System-->Query the Catalog, you get a nice GUI form for entering a query that relies on the indexing service. I haven't found a built-in cmd.exe feature for searching directories faster using the indexing service, but there is an API for writing your own tools in VBS or other languages for quering the index. Microsoft describes that API and the indexing service in more detail here.