Friday, March 13, 2009

Episode #10 - Finding Names of Files Matching a String

Hal Says:

This is one of my favorite questions to ask job interviewees, so pay attention!

Question: How would you list the names of all files under a given directory that match a particular string? For example, list all files under /usr/include that reference the sockaddr_in structure.

Most interviewees' first approximations look like this:
$ find /usr/include -type f -exec grep sockaddr_in {} \;

The only problem is that this gives you the matching lines, but not the file names. So part of the trick is either (a) asking me if it's OK to look at the grep manual page or help text (which is really the response I'm looking for), or (b) just happening to know that "grep -l" lists the file names and not the matching lines:
$ find /usr/include -type f -exec grep -l sockaddr_in {} \;

The folks who really interest me, however, are the ones who also strike up a conversation about using xargs to be more efficient:
$ find /usr/include -type f | xargs grep -l sockaddr_in

How much faster is the xargs approach? Let's use the shell's built-in benchmarker and see:
$ time find /usr/include -type f -exec grep -l sockaddr_in {} \; >/dev/null

real 0m12.734s
user 0m2.097s
sys 0m10.713s
$ time find /usr/include -type f | xargs grep -l sockaddr_in >/dev/null

real 0m0.410s
user 0m0.108s
sys 0m0.344s

You really, really want to use "find ... | xargs ..." instead of "find ... -exec ..."

Paul Says:

That's an awesome tip! I immediately put this to good use when using Metasploit. One of the requests we most often get from students when using metasploit is a way to find the exploit for a particular vulnerability. Metasploit has built in a search feature, but grep is far more powerful and comprehensive. Since all of the modules and exploits within metasploit are just Ruby files, you can use the method above to seek out functionality in Metasploit:

find ./modules/ -type f | xargs grep -li 'ms08\_*' | grep -v ".svn"

The above command will find all modules that contain references to "ms08", indicating an exploit for a vulnerability released by Microsoft in 2008.

Ed throws in his two cents:

On Windows, we have two string search tools: find and findstr. The latter has many more options (including the ability to do regex). We can use it to answer Hal's interview question with the /m option to print only the file name. Why /m? I guess because "Name" has an "m" in it, and /n was already taken to tell findstr to print line numbers.

So, the results is:
C:\> findstr /d:[directory] /m [string] [files]

The [files] lets you specify what kind of files you want to look in, such as *.ini or *.txt. To look in any kind of file, just specify a *. Also, to make it recurse the directory you specify, add the /s option.

How about an example? Suppose you want to look in C:\windows and its subdirectories for all files that contain the string "mp3". You could run:

C:\> findstr /s /d:c:\windows /m mp3 *
Another useful feature of findstr is its ability to find files that contain only printable ASCII characters using the /p flag. That is, any file with unprintable, high-end ASCII sequences will be omitted from the output, letting you focus on things like txt, inf, and related simple text files often associated with configuration:
C:\> findstr /s /p /d:c:\windows /m mp3 *
Be careful with the /p, however. You may be telling findstr to leave out a file that is important to you simply because it has one high-end ASCII sequence somewhere in the file.

Also, thanks, Hal, for now making me lust after not only xargs, -exec, ``, head, tail, awk, sed, and watch. Now, I really want a real "time" command in Windows. And, no, I'm not talking about the goofy built-in Windows time command that shows you the time of day. I'm talking about seeing how long it took another command to run. Thank goodness for Cygwin!