Tuesday, May 25, 2010

Episode #96: Hardware Death Watch

Hal's Laptop Is Having Issues

This is pretty much a SANS instructor's worst nightmare. I'm headed out to teach Forensics 508 in VA Beach, and I fire up my laptop to get some work done on the plane. The CPU fan makes a choking sound, the laptop beeps, and the screen flashes "Fan error". Thankfully, a little gentle coercion rendered the system bootable, but I'm clearly looking at a complete fan failure in the near future. So I want to keep an eye on my hardware so I can prevent an incident that involves the magic smoke.

There are a number of different ways of getting information about your hardware on Linux. The simplest is probably lshw:

# lshw
elk
description: Notebook
product: 7668CTO
vendor: LENOVO
version: ThinkPad X61s
serial: LVA9486
width: 64 bits
capabilities: smbios-2.4 dmi-2.4 vsyscall64 vsyscall32
configuration: administrator_password=disabled boot=normal chassis=notebook...

lshw provides a ton of other info on your BIOS, CPU(s), memory, disk drives, display and so on-- almost 400 lines of output on my laptop! Note that there's also the report-hw command which reports similar information, but was designed to help with debugging hardware auto-detection and so has lots of extra output that makes things less readable overall.

While lshw is good for getting an overview of the hardware configuration of your system, it doesn't probe any of the internal hardware sensors in your computer. To talk to the sensors in your CPU(s) and disk drives, you'll need a couple of other packages that are standard with most Linux distros these days: lm-sensors and smartmontools. lm-sensors interacts with the CPU sensors and smartmontools lets you get information from your disk drives, assuming they're modern enough to support the SMART device interface.

To get started with the lm-sensors package, you'll need to load the appropriate kernel modules for your device. Happily, the package includes a tool called sensors-detect that will auto-detect the kernel modules you need, and even offer to update your configuration so that the appropriate modules will be automatically loaded whenever your system boots. Here's an excerpt from the output of this program:

# sensors-detect
# sensors-detect revision 5249 (2008-05-11 22:56:25 +0200)

This program will help you determine which kernel modules you need
to load to use lm_sensors most effectively. It is generally safe
and recommended to accept the default answers to all questions,
unless you know what you're doing.

We can start with probing for (PCI) I2C or SMBus adapters.
Do you want to probe now? (YES/no): yes
[...]

Now follows a summary of the probes I have just done.
Just press ENTER to continue:

Driver `coretemp' (should be inserted):
Detects correctly:
* Chip `Intel Core family thermal sensor' (confidence: 9)

I will now generate the commands needed to load the required modules.
Just press ENTER to continue:

To load everything that is needed, add this to /etc/modules:

#----cut here----
# Chip drivers
coretemp
#----cut here----

Do you want to add these lines automatically? (yes/NO) yes
# cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

loop
lp
rtc

# Generated by sensors-detect on Sun May 23 10:59:47 2010
# Chip drivers
coretemp

Once the appropriate drivers are loaded, you can just run the sensors command-- and you don't even have to be root:

$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1: +39.0°C (crit = +127.0°C)
temp2: +39.0°C (crit = +100.0°C)

thinkpad-isa-0000
Adapter: ISA adapter
fan1: 3872 RPM
fan2: 0 RPM
temp1: +39.0°C
temp2: +46.0°C
temp3: +46.0°C
temp4: +37.0°C
ERROR: Can't get value of subfeature temp5_input: Can't read
temp5: +0.0°C
ERROR: Can't get value of subfeature temp6_input: Can't read
temp6: +0.0°C
ERROR: Can't get value of subfeature temp7_input: Can't read
temp7: +0.0°C
ERROR: Can't get value of subfeature temp8_input: Can't read
temp8: +0.0°C
temp9: +42.0°C
temp10: +38.0°C
ERROR: Can't get value of subfeature temp11_input: Can't read
temp11: +0.0°C
ERROR: Can't get value of subfeature temp12_input: Can't read
temp12: +0.0°C
ERROR: Can't get value of subfeature temp13_input: Can't read
temp13: +0.0°C
ERROR: Can't get value of subfeature temp14_input: Can't read
temp14: +0.0°C
ERROR: Can't get value of subfeature temp15_input: Can't read
temp15: +0.0°C
ERROR: Can't get value of subfeature temp16_input: Can't read
temp16: +0.0°C

coretemp-isa-0000
Adapter: ISA adapter
Core 0: +39.0°C (high = +100.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Core 1: +39.0°C (high = +100.0°C, crit = +100.0°C)

Clearly, not all temperature sensors are supported on all CPU architectures. But at least this allows me to keep up my morbid death watch on my fans and my CPU temp.

The smartmontools package includes the smartctl command for probing your disk drives. The easiest way to get started is to just use the "-a" option to dump all available info about your drive:

# smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: ST9500420AS
Serial Number: 5VJ09ARF
Firmware Version: 0002SDM1
User Capacity: 500,107,862,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun May 23 11:11:33 2010 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[...]

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 188548774
3 Spin_Up_Time 0x0003 100 098 085 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 269
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 074 060 030 Pre-fail Always - 27946375
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2501
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 202
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 121
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 061 051 045 Old_age Always - 39 (Lifetime Min/Max 28/39)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always - 5470
194 Temperature_Celsius 0x0022 039 049 000 Old_age Always - 39 (0 11 0 0)
195 Hardware_ECC_Recovered 0x001a 047 043 000 Old_age Always - 188548774
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 115611929676228
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 978381419
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1156631671
254 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
[...]

Again, there's a ton of other output from this command, which I'm not showing in the interests of space. There are smartctl options to just dump out specific pieces of the above info, and they're all documented in the manual page.

Frankly, I think it's pretty cool that I can retrieve the disk model and serial number without having to crack the case. I can also read the temp of the drive itself and at the airflow output, which is of interest to me right now. But as you can see, I can also get information about the number of hours on the drive and so on. This could be used to help alert you to drives that may be needing replacement before they actually fail.

So there's plenty of information available for me to keep an eye on things this week as I'm teaching my class. Keep your fingers crossed for me. In the meantime, let's see what Ed and Tim have up their sleeves.

Ed responds a little nervously:
I get shivers thinking about system failure as a presenter at conferences. I used to travel with two laptops to keep my mind at ease, but in the past year, I started carrying just one as my back started to hurt. Rumor has it that there are backup laptops that will materialize in an instant at a conference, but you never know. USB tokens with backup presentations are a good idea.

As for commands to check hardware information on Windows, our little friend WMIC comes in handy. For details about the motherboard, we could run:
C:\> wmic baseboard list full

The output here will show us Manufacturer and SerialNumber, among other things.

For CPU information, we can run:
C:\> wmic cpu list full

This will show us a description of the CPU, its manufacturer, and speed.

But, in that glut of output, there are also a couple of useful items that may indicate trouble on our system. Let's zoom in on them:
C:\> wmic cpu get currentclockspeed,maxclockspeed

If you see a big difference in these numbers, it could be due to a couple of reasons. First off, your system may be running under a low power condition, so it slows down the processor to save power, making currentclockspeed lower than maxclockspeed. That's nothing to worry about. The other condition, however, is that your system has gotten kinda hot, so it's slowing itself down. That's something to worry about.

To get a feel for the temperature of your system, you could run:
C:\> wmic /namespace:\\root\wmi PATH MSAcpi_ThermalZoneTemperature get 
CurrentTemperature
CurrentTemperature
3172
Now, it should be noted that pulling this temperature data isn't supported on all hardware, and on some hardware, it never changes beyond boot time. Still, on many modern non-virtual systems, it'll tell you your temperature in tenths of degree Kelvin. I just went to Google and did a search for "317.2 degrees kelvin to" and before I finished typing, the predictive search responded with:
317.2 kelvin = 44.05 degrees Celsius

Cool, Google. A little creepy, but cool. "Google: A little creepy, but cool" should be Google's new motto, supplanting "Don't be Evil."

Of course, then, I type "44.05 degrees Celsius to f" and it pops up and tells my system is running at 111.29 degrees Fahrenheit. Toasty.

The ScriptInternals guys have put together a list of the items you can read using this command besides the CurrentTemperature. You can pull all of this data with:

C:\> wmic /namespace:\\root\wmi PATH MSAcpi_ThermalZoneTemperature get *

While all this temperature stuff is nice, what about a prediction of whether our hard drive is hosed? We can pull that information with:

C:\> wmic /namespace:\\root\wmi PATH MSStorageDriver_FailurePredictStatus get 
predictfailure
PredictFailure
FALSE

Whew, that's a relief. If this output says TRUE, your drive is ready to give up the ghost soon, so you should backup immediately! You don't want to fall into the "Hal Pomeranz conference laptop deathwatch trap".

Tim sometimes wish a presenter's laptop would die:

We've all been there, a presentation where the presenter is just reading every word on every slide with no extra content or commentary. That presenter's laptop need to die, to take one for the team so the rest of us can live.

I've heard Ed and Hal present, both are great speakers, so their laptops are not required to become martyrs. Let's give them a bit of a check up.

Checking a laptop's status in PowerShell is very similar to what Ed did. Here are the PowerShell versions of Ed's commands.

Motherboard - Manufacturer and Serial Number:
PS C:\> gwmi win32_baseboard


CPU information - Description, Manufacturer, and Speed:
PS C:\> gwmi win32_processor


Temperature:
PS C:\> Get-WmiObject -class MSAcpi_ThermalZoneTemperature -Namespace root\WMI


We have the same problem as Ed, Kelvin. Let's convert to Fahrenheit. My undergraduate degree was in Engineering, and I had to take a Thermodynamics class. One thing I remember is that 0 Kelvin is 273.15 Celsius. I also remember how to convert Celsius to Fahrenheit: add 40, multiply by 9, divide by 5, and finally subtract 40. Here it is in only line.

PS C:\> (((Get-WmiObject -class "MSAcpi_ThermalZoneTemperature" -Namespace
"root\WMI").CurrentTemperature / 10 - 233.15) * 9 / 5) - 40

124.79


Let's check the drive status:
PS C:\> Get-WmiObject -class MSStorageDriver_FailurePredictStatus -Namespace root\WMI | Select Active, PredictFailure

Active PredictFailure
------ --------------
True False


Good news, the drive is alive, and not predicted to die!

One other think I like to check is the battery:

PS C:\> gwmi Win32_Battery | select est*

EstimatedChargeRemaining EstimatedRunTime
------------------------ ----------------
97 231


I can run for almost 4 hours. That's a long presentation, and a lot of slides to read.