Tuesday, February 22, 2011

Episode #135: His name is my name too

Tim takes it easy this week

This week Cory Williams writes in:

I would like to have a script that will find files with the same file name prefix but different extension. For instance, let's say within \Windows\System32 there are two files, FileName1.EXE and FileName1.DLL. One given is that it will always be an .EXE and a .DLL in combination. However, I will not know the prefix names. Is there a script that can find such files?

One of our basic rules of this site is no scripts. We have pushed the line a few times, but if we do it again the universe as we now it may implode. And yes, this blog is that important to the universe. Now back to the very important task at hand.

This task is quite simple in PowerShell. File and Directory objects both contain a BaseName property that contains the name minus the extension.

PS C:\> ls myfile.txt | select name,basename | ft -a

Name BaseName
---- --------
myfile.txt myfile


We can use this property with the Group-Object cmdlet to find files with matching basenames. The results can be piped into the Where-Object cmdlet (alias ?) to filter for groups with more than one object.

PS C:\Windows\system32> ls | group basename | ? { $_.Count -gt 1 }

Count Name Group
----- ---- -----
2 Boot {Boot, boot.sdi}
2 config {config, config.nt}
2 Dism {Dism, Dism.exe}
2 DRVSTORE {DRVSTORE, drvstore.dll}
2 ias {ias, ias.dll}
2 migwiz {migwiz, migwiz.lnk}
2 Msdtc {Msdtc, msdtc.exe}
...


If you don't want to include directories in the mix then we can filter them out before the grouping.

PS C:\Windows\system32> ls | ? { -not $_.PSIsContainer } |
group basename | ? { $_.Count -gt 1 }


Count Name Group
----- ---- -----
2 activeds {activeds.dll, activeds.tlb}
2 certmgr {certmgr.dll, certmgr.msc}
3 cliconfg {cliconfg.dll, cliconfg.exe, cliconfg.rll}
2 DELS3ci {DELS3ci.dll, DELS3ci.exe}
2 DELS3L3 {DELS3L3.DLL, DELS3L3.SMT}
2 diskcopy {diskcopy.com, diskcopy.dll}
2 dssec {dssec.dat, dssec.dll}
...


Pretty simple. Now let's see what Hal's got in store for us.

Hal takes it where he can get it:

Of course Tim's solution makes me immediately think of the Unix basename command, and ultimately how frustrating it is. Frustrating because, unlike most Unix commands, basename won't read input from stdin. You can only call basename on individual strings fed in on the command line. And even then you can only specify a single file name extension to use when reducing the string-- e.g. "basename hosts.deny .deny".

But instead of moaning about what we don't have, let's look at what we do have. sed will allow us to read in a list of file names on stdin and chop off everything after the last dot:

$ ls | sed 's/\.[^.]*//'
a2ps
a2ps-site
acpi
adjtime
alchemist
aliases
aliases
...

Now all we have to do is list the prefixes that are duplicated. A little bit of tweaking to our sed expression, some uniq action, and output redirection gets us where we need to be:

$ ls -d $(ls | sed 's/\.[^.]*/.*/' | uniq -d)
ant.conf cron.weekly issue.rpmnew prelink.conf
ant.d csh.cshrc logrotate.conf prelink.conf.d
auto.master csh.login logrotate.d rc.d
auto.misc dnsmasq.conf modprobe.conf rc.local
auto.net dnsmasq.d modprobe.d rc.news
auto.smb dovecot.conf ntp.conf rc.sysinit
cron.d dovecot.conf-dist ntp.conf-dist xinetd.conf
cron.daily dovecot.conf.rpmnew ntp.conf.rpmnew xinetd.d
cron.deny hosts.allow php.d
cron.hourly hosts.deny php.ini
cron.monthly issue.net prelink.cache

In the pipeline inside the "$(...)" construct I'm using sed to convert the output of ls from "<filename>.<ext>" to "<filename>.*". The "uniq -d" command gives me only the base <filename>.* patterns that are listed more than once. Then we use these wildcards as command-line arguments in the outer ls command, and we get the list of matching files.

You'll notice that I used "ls -d ..." here, because some of the returned values are going to match directories. If you want to filter out directories as Tim does above, then things get a lot more complicated:

$ ls -d $(find . -maxdepth 1 ! -type d | sed 's/.\/\(.*\.\)[^.]*/\1*/' | sort | uniq -d)
auto.master dovecot.conf-dist ld.so.conf prelink.conf
auto.misc dovecot.conf.rpmnew ld.so.conf.d prelink.conf.d
auto.net hosts.allow ld.so.conf.rpmnew rc.d
auto.smb hosts.deny ntp.conf rc.local
csh.cshrc issue.net ntp.conf-dist rc.news
csh.login issue.rpmnew ntp.conf.rpmnew rc.sysinit
dovecot.conf ld.so.cache prelink.cache

The find command locates the non-directory objects in the current directory. But the output of the find command gives us "./<filename>.<ext>". So we have a more complicated sed expression to convert this to "<filename>.*". Also, the output of the file command need not be in sorted order like the output of ls, so we have to "sort" before we feed the whole mess into "uniq -d".

The problem now is that our output still contains some directories. You'll notice, for example, that "rc.*" matches not only rc.local, rc.news, and rc.sysinit, but also rc.d, which is a directory. So we actually have to post-process the output and get rid of directories:

$ for file in $(ls -d $(find . -maxdepth 1 ! -type d | sed 's/.\/\(.*\.\)[^.]*/\1*/' | 
sort | uniq -d)); do
[ -d $file ] || echo $file;
done

...
prelink.cache
prelink.conf
rc.local
rc.news
rc.sysinit

Man, that's pretty fugly! But I just don't see a clean way to do this. Let me know if you can think of a better one.

When all you have is a Haemer...

It was nice to hear from "friend of the blog" Jeff Haemer again. He sent us this alternative approach to our problem:

$ for f in *.*; do echo ${f%.*}.*; done | uniq -d
...
prelink.cache prelink.conf prelink.conf.d
rc.d rc.local rc.news rc.sysinit
xinetd.conf xinetd.d

Jeff's basically bypassing external tools here and using the shell built-ins to do almost everything. The outer loop matches any file with a dot in the name, but inside the loop is where the magic happens.

The construct "${var%pattern}" strips off anything matching "pattern" from the right-hand side of "var". In this case, Jeff is getting rid of the final dot and any extension from the file name. So, for example, "xinetd.conf" and "xinetd.x" are both reduced to "xinetd".

Then Jeff simply does "echo <result>.*"-- e.g., "echo xinetd.*", continuing with the previous example. In the case of files that have the same prefix, that will yield multiple lines of duplicate output, which Jeff post-processes the loop for with "uniq -d".

The only difficulty is that Jeff's solution matches directories as well as files. You'll have to do some post-processing to filter out the directories if that's important to you. However, you'll need to be careful with lines like our "xinetd" example. Once you get rid of "xinetd.d", you'll be left with just "xinetd.conf", which will also need to be filtered out because there are no other files in the directory with this prefix.

Anyway, thanks for writing in again, Jeff! Always good to hear from you.