Tuesday, May 11, 2010

Episode #94: A Date With Death

Hal checks into the mailbag

We received a note recently from a new reader, Ray Kano, who had a question for the blog:
Is there any way using WMIC to write a taskkill command that will kill [processes by name and] based on a date-time stamp?

Now obviously Ray is looking for a Windows solution, and I'll let Tim clean up on that side of the house since Ed is still on vacation. But the question got me thinking if there was an analogous command on Unix for killing processes by name and by date. This turns out to be a lot harder in Unix than I thought it would be, but I learned a lot in the process of figuring out the solution.

My first thought was to do something clever with /proc. I had just assumed that the date-time stamps on the /proc/<pid> directories corresponded with the date the process was spawned. Nothing could be further from the truth:

# uptime
14:55:26 up 23:51, 6 users, load average: 0.43, 0.26, 0.19
# date
Sun May 2 14:55:28 PDT 2010
# stat /proc/1
File: `/proc/1'
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 3h/3d Inode: 533233 Links: 7
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2010-05-02 14:55:31.256904804 -0700
Modify: 2010-05-02 14:55:31.256904804 -0700
Change: 2010-05-02 14:55:31.256904804 -0700

Note that the system has been up just under a day, but the MAC times on the /proc/1 directory belonging to the init process are all set to the moment I ran the stat command to retrieve the data. Now I did this test on my Linux system and haven't checked other Unix platforms, but clearly relying on /proc isn't going to be a portable solution.

My next thought was to check and see if the killall or pkill commands had options for selecting processes based on date and time. It turns out pkill has the "-o" and "-n" options for killing the oldest or newest processes that match your search criteria, but nothing more selective than that. killall is no help at all.

If you've been reading this blog for a while, you can probably guess where I went next: my "little friend" lsof. But guess what? As far as I can tell, lsof has no capability to even output the starting date and time of a process, much less select processes based on that information.

This was starting to really interest me now. I was sure that the kernel keeps track of the starting date and time of each process, but there didn't seem to be any simple way of getting at this data. In desperation, I started reading the ps manual page and discovered that you can get ps to output a couple of different time values: the "start_time" and the "etime", which is short for "elapsed time". Let's check out "start_time" first with the ps output from one my Linux servers that's been up for a while:

# $ ps -eo pid,comm,start_time
PID COMMAND START
1 init 2009
...
1009 sendmail Jan04
1020 sendmail Jan04
...
29399 sshd 12:11
29401 sshd 12:11
...

The "-o" option allows me to specify a list of fields to output. Note that the names of the various fields and the list of available fields can vary from OS to OS, but the ones I'm using here are pretty standard across many Unix variants.

But.. *yuck*! The output format here is not helpful at all. Processes that were started today show up with HH:MM format. But processes started yesterday or earlier just show up as MonDD, and processes started before Jan 1 of the current year show up as YYYY. I can't do anything useful with this stuff.

Keeping my fingers crossed, I tried using "etime" instead of "stime":

# ps -eo pid,comm,etime
PID COMMAND ELAPSED
1 init 369-07:14:45
...
1009 sendmail 118-01:28:11
1020 sendmail 118-01:28:10
...
29399 sshd 03:47:33
29401 sshd 03:47:31
...
29777 ps 00:00
...

OK, I can work with this. The elapsed time format is [[days-]HH:]MM:SS, which is still kind of a pain but not impossible. I can easily break each line up into a number of tokens. But the problem is that sometimes minutes and seconds will be the third and fourth tokens, sometimes the fourth and fifth tokens, and sometimes even the fifth and sixth tokens. Life would be better if we could reverse the time format so that it was SS:MM[:HH[-days]], which would make everything nice and regular.

I can handle the necessary field reversal with a little awk fu:

# ps -eo pid,comm,etime | tail -n +2 | sed 's/[-:]/ /g' | \
awk '{print $1, $2, $6, $5, $4, $3}'

1 init 59 23 07 369
...
1009 sendmail 25 37 01 118
1020 sendmail 24 37 01 118
...
29399 sshd 47 56 03
29401 sshd 45 56 03
...
29803 ps 00 00
...

Here I'm using the tail command to drop the initial header line and then using sed to turn the dash and colons in the time format to spaces. From there it's a matter of using awk to selectively reverse the last four fields of output. awk doesn't complain if some of the fields don't exist, it simply outputs an empty string.

With the fields now in a canonical order, all I need to do is convert the time value into a format that's useful for comparisons-- like say total elapsed seconds:

# ps -eo pid,comm,etime | tail -n +2 | sed 's/[-:]/ /g' | \
awk '{print $1, $2, $6, $5, $4, $3}' | \
awk '{print $1, $2, ($3 + $4 * 60 + $5 * 3600 + $6 * 86400)}'

1 init 31908525
...
1009 sendmail 10201331
1020 sendmail 10201330
...
29399 sshd 14493
29401 sshd 14491
...
29809 ps 0
...

That's more like it! So I've demonstrated that I can get to a list of PIDs, process names, and total seconds that the process has been running. I'm sure that if I thought about it some more, I could come up with a single awk statement to do what I'm doing with two statements above, but I think the above code is clearer and it wasn't really that hard to type.

But remember the original request was for a command to kill processes by name and date-time stamp, and not just output data for all processes. So our second awk statement is going to change anyway. Let's suppose that we wanted to kill all sshd processes that had been around for longer than 10 days. We could output the PIDs of the matching processes as follows:

# ps -eo pid,comm,etime | tail -n +2 | sed 's/[-:]/ /g' | \
awk '{print $1, $2, $6, $5, $4, $3}' | \
awk '($2 == "sshd") && (($3 + $4 * 60 + $5 * 3600 + $6 * 86400) > 864000) {print $1}'

5725

Other queries would be simpler. For example, let's output the PIDs of all sshd processes that have been active less than one day:

# ps -eo pid,comm,etime | tail -n +2 | sed 's/[-:]/ /g' | \
awk '{print $1, $2, $6, $5, $4, $3}' | \
awk '($2 == "sshd") && ($6 == "") {print $1}'

4727
4729
4805
4807
29399
29401

Here all we're doing is confirming that the sixth field is unset, which must mean that the process has been running less than one day. We don't need to do any math at all.

Anyway, now that we can select and output PIDs at will, the final solution is just putting the whole command in backticks and using it as an argument to the kill command:

# kill -9 `ps -eo pid,comm,etime | ...`

Whoosh! That sure was a lot of work for a simple request! I'm sort of shocked that Unix makes this so difficult. Could this be an opportunity for Tim to show me up with some Windows magic?

Tim opens Ed's mail

If "glory is fleeting, but obscurity is forever" (Napoleon) then that "fu" is going to live longer than either of us. Too bad Ed isn't here to bask in the glory of how easy this is in Windows. Of course, he is basking in the sun on vaction this week.

While Ed is gone, I like take a peak through his mail. Bills. Junk. More Bills. Victoria Secret catalog. A shipment of peanut butter, a stuffed water bufallo and some latex? Uh...Anyway, I did steal some of it too. Not the "other" stuff, but this easy episode.

Way back in episode 22, Ed killed process with wmic. This topic has been revisited a few times, including my favorite episode, Advanced Process Whack-a-Mole. If "wmic process" were a dead horse, we would have severely beaten it. We do have the new twist of searching based on the creation date, and it is pretty easy.

C:\> wmic process where (name="cmd.exe" AND creationdate ^< "20100511060000.000000-300") delete
The date format is yyyymmddhhmmss.mmmmmm-TTT. I have no idea what the -300 means Edit: Where the TTT is the timezone and it is required in the query. If you remove it you will get an Invalid Query error. In my case -300 represents my timezone (GMT -6).

Also, we have to escape any greater than or less than signs. The greater than and less than signs are used for redirection and the caret (^) character is used to escape it. I don't know how to make it sound more confusing, like Hal's section.

Tim opens his mail

This task is even easier in PowerShell, and it is pretty self explanatory, too.

C:\> Get-Process cmd | ? { $_.StartTime -lt "2010/5/11 6:00" } | Stop-Process
We can even try to find processes that have been running for longer than an hour.

C:\> Get-Process cmd | ? { $_.StartTime -lt (Get-Date).AddHours(-1) } | Stop-Process
In both cases, we use Get-Process to find processes named cmd. The next step is to filter based on the start time. Finally, we kill it.

Sorry Hal, for not making this portion totally unreadable and for not making this way more complicated that it should be. Got a bit of shell envy this week?

Signed, sealed, delivered.