Tuesday, February 9, 2010

Episode #81: From the Mailbag

Hal checks out the mail

We love getting email from readers of the blog. And we love getting cool shell hacks from readers even more. Recently, loyal reader Rahul Sen sent along this tasty little bit of shell fu:

How to search for certain text string in a directory and all its subdirectories, but only in files of type text, ascii, script etc:

$ grep 9898 `find /usr/local/tripwire -type f -print | xargs file |
egrep -i 'script|ascii|text' | awk -F":" '{print $1}'`

/usr/local/tripwire/te/agent/data/config/agent.properties:tw.local.port=9898
/usr/local/tripwire/te/agent/data/config/agent.properties:tw.server.port=9898


That's totally cool, Rahul!

Honestly, when I first looked at this I thought, "There's got to be a shorter way to do this." But the tricky part is the "only in files of type text, ascii, script, etc" requirement. This basically forces you to do a pass through the entire directory first in order to locate the relevant file types. Thus the complicated pipeline to pass everything through the file command and egrep.

A few minor improvements I might suggest:

  1. I'm worried that for a large directory you'll end up returning enough file names that you exceed the built-in argument list limits in the shell. So it might be better to use xargs again rather than backticks.

  2. I probably would have chosen to use sed at the end of the pipeline rather than awk, just to be more terse.

  3. You don't actually need "-print" on modern versions of find-- it's now the default. Only old-timers like me and Rahul end up doing "-print" all the time because we were trained to do so by old versions of find.


So my revised version would look like:

$ find /usr/local/tripwire -type f | xargs file |
egrep -i 'script|ascii|text' | sed 's/:.*//' | xargs grep 9898

Mmmm-hmmm! That's some tasty fu! Let's see what my Windows brethren have cooking...

Ed Unfurls:
This is a helpful little technique. Now, unfortunately at the Windows command line (man, if I only had a dime ever time I said that phrase), we do not have the "file" command to discern the type of file. But, fear not! We do have a couple of alternative methods.

For a first option, we could use a nifty feature of the findstr command to ignore files that have non-printable characters. When run with the /p option, findstr will ignore any files that contain high-end ASCII sequences, letting us skip over EXEs, DLLs, and other stuff. It's not as fine a grained scalpel as scraping the output of the the Linux file command for script, ascii, and text, but it'll serve us well as follows:

C:\> findstr /s /p 9898 *

Here, I'm using findstr to recurse the file system (/s) from wherever my current working directory is, skipping files with non-printable characters (/p), looking for the string 9898 in any file (*). If you want to get even closer to the original, we can specify a directory where we want to start the search using the /d: option as follows:

C:\> findstr /d:C:\windows /s /p 9898 *

Now, for our second option, there is another way to refine our search besides the /p option of findstr, getting us a little closer to the file types Rahul specified in Linux using the find command. It turns out that Microsoft actually put an indication of each file's type in the name of the file itself. You see, by convention, Windows file names have a dot followed by three letters that indicate the file type. Who knew?!?! :)

To map the desired functionality to Windows, we'll rely on file name suffixes to look inside of .bat, .cmd, .vbs, and .ps1 files (various scripts), .ini files (which often contain config info), and .txt files (which, uh... you know). What's more, many commands associated with searching files allow us to specify multiple file names, with wild cards, such as the dir command in this example:

C:\> dir *.bat *.cmd *.vbs *.ps1 *.ini *.txt

And, happy to say, dir isn't the only one that lets us look for multiple file names with wildcards. For my second solution to this challenge, I'm going to use a FOR /R loop. These loops recurse through a directory structure (/R, doncha know) setting an iterator variable to the name of each file that is encountered. Thus, we can use the following command as a rough equivalent to Rahul's Linux fu:

C:\> FOR /R C:\ %i in (*.bat *.cmd *.vbs *.ps1 *.ini *.txt) do @findstr 9898 "%i" && echo %i

Here, I'm running through all files found under C:\ and its subdirectories, looking inside of any file that has a suffix of .bat, .cmd, etc, running findstr on each file (whose name is stored in %i, which has to be surrounded in quotes for those cases when the value has one or more spaces in the file name) looking for 9898. And, if I successfully find a match, I echo out the file's name. Now, this output looks a little weird, because the file's name comes after each line that contains the string. But, that is a more efficient way to do the search. Otherwise, I'd have to introduce unnecessary complexity by using a variable and parsing to store the line of the file and print its name first, then print the contents. I'd certainly do that for prettiness in a script. But, at the command line by itself, I'd eschew the complexity and just go with what I've shown above to get the job done.

Now, there's a gazillion other ways to do this as well. For a third possibility, we could take the first option above (findstr) and use the multiple file suffix specification of option 2 (*.bat *.cmd *.vbs *.ps1 *.ini *.txt) to come up with:

C:\> findstr /d:C:\windows /s 9898 *.bat *.cmd *.vbs *.ps1 *.ini *.txt

I actually like this third approach best, because it's relatively easy to type, makes a bunch of sense, has nicer-looking output than the FOR /R option, and has better performance.

Fun, fun, fun! Thanks for the great suggestion, Rahul.

Whatcha got for us, Tim?


Tim delivers:

Sadly, PowerShell is missing the same "file" command as the standard Windows command line. Also, there isn't a PowerShell cmdlet similar to "findstr /p" either, but of course we could use FindStr since all the Windows commands are available in PowerShell. Ed already covered FindStr so we will just use just PowerShell cmdlets.

If we know that the files in question are in a specific directory, not subdirectories, there is a pretty simple command to find the files using the Select-String cmdlet.

PS C:\> Select-String 9898 -Path *.bat,*.cmd,*.vbs,*.ps1,*.ini,*.txt
-List | Select Path


Path
----
C:\temp\a.txt


According to the documentation "the Select-String cmdlet searches for text and text patterns in input strings and files. You can use it like Grep in UNIX and Findstr in Windows." Whoah, big difference there! Grep has much more robust regular expression support when compared to FindStr, and yes, PowerShell does give us rich regular expressions.

Back to the task at hand. We only care if the file contains the text in question, not how many times the text is found in the file. The List parameter is used as a time saver, since it will stop searching after it finds the first match in a file.

For each match, the default console output displays the file name, line number, and all text in the line containing the match. Of course the output is an object and we just want the file's path, so the results are piped into Select-Object (alias select) in order to get the full file path.

But we want to search though subdirectories too. To do that we have to use Get-ChildItem (alias dir, gci, ls).

PS C:\> Get-ChildItem -Include *.bat,*.cmd,*.vbs,*.ps1,*.ini,*.txt
-Recurse | Select-String 9898 -List | Select-Object path


Path
----
C:\temp\subfolder\b.txt
C:\temp\a.txt


The Recurse parameter specifies that the search should recursively search through subdirectories. The Include parameter retrieves only the files that match our filter. We could use the Exclude parameter if we wanted to search all files that aren't exe's or dll's.

PS C:\> Get-ChildItem -Exclude *.exe,*.dll -Recurse |
Select-String 9898 -List | Select-Object path


The command can be shortened even further since the full parameter name doesn't have to be used. As long the shortened parameter name isn't ambiguous a short version of the parameter name can be used. Since there is no other parameters that start with R we or I this command will work as well:

PS C:\> ls -i *.bat,*.cmd,*.vbs,*.ps1,*.ini,*.txt -r |
select 9898 -L | select path


There is a catch if you are using version 1 of PowerShell, the Select-String cmdlet natively doesn't take the input from Get-ChildItem and use it to specify the path. We have to use a For-EachObject loop (alias %) in order to accomplish the same task.

PS C:\> Get-ChildItem -Include *.bat,*.cmd,*.vbs,*.ps1,*.ini,*.txt
-Recurse | % { Select-String 9898 -List -Path $_.FullName } |
Select-Object path


A side note:
As you probably already know, Windows XP, Vista, and 2003 don't come with Powershell and require a separate install, but for the love of Pete, install it. Version 2 has been available for quite a while and there are many enhancements over v1 (and even more when compared to cmd). Windows 2008 R2 and Windows 7 come with v2 by default.

PowerShell v1 in Windows 2008 (R1) is an optional feature that needs to be enabled. It can be installed using the Windows command line by running this command:

C:\> ServerManagerCmd.exe -install PowerShell


This is probably the most useful Windows command available (sorry Ed).

Seth Matheson (Our Brand-Spankin'-New Mac OS X Go-To Guy) Interjects:

While it's great to have 3 or 4 commands to string together to get the output you want (heck, this is part of the joy of the power of 'nix, the freedom to do almost anything because you're not restricted to a specific command), sometimes it's nice to have one command that will do it for you, out of the box.

Enter Spotlight on the Mac. Spotlight is Apple's built in OS-level search that functions on meta tags and even though everyone rags on them for the GUI, Apple usually puts in a command-line function or two for our trouble.

$ mdfind evil

The above command will search file names and file contents for the string "evil". Simple enough right?

$ mdfind evil -onlyin /Users/ed

Still pretty simple, this will limit the search for "evil" in Ed's home directory. I'm sure we won't find anything, right Ed?

So back to the original challenge, what good would this be unless it can search for file type? Guess what, you can!

$ mdfind "evil kind:text" -onlyin /Users/ed

This will find all text files in Ed's home directory with the string "evil". Wonder what that'll come up with...

You can use kind for other things too, Applications, Contacts, Folders, etc.

This is only a very simple Spotlight search. You can go much further. Everything in OS X (10.4 and above) has meta tags that detail all sorts of interesting things about the files. Run the following to find out just what kind of data that Spotlight sees:

$ mdls "/Users/ed/Desktop/March CLK Fu.txt"

kMDItemContentCreationDate = 2010-02-09 17:54:00 -0500
kMDItemContentModificationDate = 2010-02-09 17:54:00 -0500
kMDItemContentType = "public.plain-text"
kMDItemContentTypeTree = (
"public.plain-text",
"public.text",
"public.data",
"public.item",
"public.content"
)
kMDItemDisplayName = "March CLK Fu.txt"
kMDItemFSContentChangeDate = 2010-02-09 17:54:00 -0500
kMDItemFSCreationDate = 2010-02-09 17:54:00 -0500
kMDItemFSCreatorCode = ""
kMDItemFSFinderFlags = 0
kMDItemFSHasCustomIcon = 0
kMDItemFSInvisible = 0
kMDItemFSIsExtensionHidden = 0
kMDItemFSIsStationery = 0
kMDItemFSLabel = 0
kMDItemFSName = "March CLK Fu.txt"
kMDItemFSNodeCount = 0
kMDItemFSOwnerGroupID = 501
kMDItemFSOwnerUserID = 501
kMDItemFSSize = 3
kMDItemFSTypeCode = ""
kMDItemKind = "Plain text"
kMDItemLastUsedDate = 2010-02-09 17:54:00 -0500
kMDItemUsedDates = (
2010-02-09 00:00:00 -0500
)

All fun things you can search by! For example:

$ mdfind "kMDItemFSOwnerGroupID == '501'"

Will get everything owned by UID 501 (usually the first created user on the system).

Note: It should be mentioned that most of the 'nix command line fu on this site will work on a Mac, it is after all BSD under the hood. That being said, sometimes we can save some time with the built in utilities, like Spotlight.