Command Line Kung Fu: Episode #96: Hardware Death Watch

Hal's Laptop Is Having Issues

This is pretty much a SANS instructor's worst nightmare. I'm headed out to teach Forensics 508 in VA Beach, and I fire up my laptop to get some work done on the plane. The CPU fan makes a choking sound, the laptop beeps, and the screen flashes "Fan error". Thankfully, a little gentle coercion rendered the system bootable, but I'm clearly looking at a complete fan failure in the near future. So I want to keep an eye on my hardware so I can prevent an incident that involves the magic smoke.

There are a number of different ways of getting information about your hardware on Linux. The simplest is probably lshw:

# lshw
elk                     
  description: Notebook
  product: 7668CTO
  vendor: LENOVO
  version: ThinkPad X61s
  serial: LVA9486
  width: 64 bits
  capabilities: smbios-2.4 dmi-2.4 vsyscall64 vsyscall32
  configuration: administrator_password=disabled boot=normal chassis=notebook...

lshw provides a ton of other info on your BIOS, CPU(s), memory, disk drives, display and so on-- almost 400 lines of output on my laptop! Note that there's also the report-hw command which reports similar information, but was designed to help with debugging hardware auto-detection and so has lots of extra output that makes things less readable overall.

While lshw is good for getting an overview of the hardware configuration of your system, it doesn't probe any of the internal hardware sensors in your computer. To talk to the sensors in your CPU(s) and disk drives, you'll need a couple of other packages that are standard with most Linux distros these days: lm-sensors and smartmontools. lm-sensors interacts with the CPU sensors and smartmontools lets you get information from your disk drives, assuming they're modern enough to support the SMART device interface.

To get started with the lm-sensors package, you'll need to load the appropriate kernel modules for your device. Happily, the package includes a tool called sensors-detect that will auto-detect the kernel modules you need, and even offer to update your configuration so that the appropriate modules will be automatically loaded whenever your system boots. Here's an excerpt from the output of this program:

# sensors-detect
# sensors-detect revision 5249 (2008-05-11 22:56:25 +0200)

This program will help you determine which kernel modules you need
to load to use lm_sensors most effectively. It is generally safe
and recommended to accept the default answers to all questions,
unless you know what you're doing.

We can start with probing for (PCI) I2C or SMBus adapters.
Do you want to probe now? (YES/no): yes
[...]

Now follows a summary of the probes I have just done.
Just press ENTER to continue:

Driver `coretemp' (should be inserted):
Detects correctly:
* Chip `Intel Core family thermal sensor' (confidence: 9)

I will now generate the commands needed to load the required modules.
Just press ENTER to continue:

To load everything that is needed, add this to /etc/modules:

#----cut here----
# Chip drivers
coretemp
#----cut here----

Do you want to add these lines automatically? (yes/NO) yes
# cat /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

loop
lp
rtc

# Generated by sensors-detect on Sun May 23 10:59:47 2010
# Chip drivers
coretemp

Once the appropriate drivers are loaded, you can just run the sensors command-- and you don't even have to be root:

$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:       +39.0°C  (crit = +127.0°C)                
temp2:       +39.0°C  (crit = +100.0°C)                

thinkpad-isa-0000
Adapter: ISA adapter
fan1:       3872 RPM
fan2:          0 RPM
temp1:       +39.0°C                                  
temp2:       +46.0°C                                  
temp3:       +46.0°C                                  
temp4:       +37.0°C                                  
ERROR: Can't get value of subfeature temp5_input: Can't read
temp5:        +0.0°C                                  
ERROR: Can't get value of subfeature temp6_input: Can't read
temp6:        +0.0°C                                  
ERROR: Can't get value of subfeature temp7_input: Can't read
temp7:        +0.0°C                                  
ERROR: Can't get value of subfeature temp8_input: Can't read
temp8:        +0.0°C                                  
temp9:       +42.0°C                                  
temp10:      +38.0°C                                  
ERROR: Can't get value of subfeature temp11_input: Can't read
temp11:       +0.0°C                                  
ERROR: Can't get value of subfeature temp12_input: Can't read
temp12:       +0.0°C                                  
ERROR: Can't get value of subfeature temp13_input: Can't read
temp13:       +0.0°C                                  
ERROR: Can't get value of subfeature temp14_input: Can't read
temp14:       +0.0°C                                  
ERROR: Can't get value of subfeature temp15_input: Can't read
temp15:       +0.0°C                                  
ERROR: Can't get value of subfeature temp16_input: Can't read
temp16:       +0.0°C                                  

coretemp-isa-0000
Adapter: ISA adapter
Core 0:      +39.0°C  (high = +100.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Core 1:      +39.0°C  (high = +100.0°C, crit = +100.0°C)

Clearly, not all temperature sensors are supported on all CPU architectures. But at least this allows me to keep up my morbid death watch on my fans and my CPU temp.

The smartmontools package includes the smartctl command for probing your disk drives. The easiest way to get started is to just use the "-a" option to dump all available info about your drive:

# smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST9500420AS
Serial Number:    5VJ09ARF
Firmware Version: 0002SDM1
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sun May 23 11:11:33 2010 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[...]

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       188548774
3 Spin_Up_Time            0x0003   100   098   085    Pre-fail  Always       -       0
4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       269
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
7 Seek_Error_Rate         0x000f   074   060   030    Pre-fail  Always       -       27946375
9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2501
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       -       202
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       121
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   051   045    Old_age   Always       -       39 (Lifetime Min/Max 28/39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       5470
194 Temperature_Celsius     0x0022   039   049   000    Old_age   Always       -       39 (0 11 0 0)
195 Hardware_ECC_Recovered  0x001a   047   043   000    Old_age   Always       -       188548774
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       115611929676228
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       978381419
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1156631671
254 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
[...]

Again, there's a ton of other output from this command, which I'm not showing in the interests of space. There are smartctl options to just dump out specific pieces of the above info, and they're all documented in the manual page.

Frankly, I think it's pretty cool that I can retrieve the disk model and serial number without having to crack the case. I can also read the temp of the drive itself and at the airflow output, which is of interest to me right now. But as you can see, I can also get information about the number of hours on the drive and so on. This could be used to help alert you to drives that may be needing replacement before they actually fail.

So there's plenty of information available for me to keep an eye on things this week as I'm teaching my class. Keep your fingers crossed for me. In the meantime, let's see what Ed and Tim have up their sleeves.

Ed responds a little nervously:
I get shivers thinking about system failure as a presenter at conferences. I used to travel with two laptops to keep my mind at ease, but in the past year, I started carrying just one as my back started to hurt. Rumor has it that there are backup laptops that will materialize in an instant at a conference, but you never know. USB tokens with backup presentations are a good idea.

As for commands to check hardware information on Windows, our little friend WMIC comes in handy. For details about the motherboard, we could run:

C:\> wmic baseboard list full

The output here will show us Manufacturer and SerialNumber, among other things.

For CPU information, we can run:

C:\> wmic cpu list full

This will show us a description of the CPU, its manufacturer, and speed.

But, in that glut of output, there are also a couple of useful items that may indicate trouble on our system. Let's zoom in on them:

C:\> wmic cpu get currentclockspeed,maxclockspeed

If you see a big difference in these numbers, it could be due to a couple of reasons. First off, your system may be running under a low power condition, so it slows down the processor to save power, making currentclockspeed lower than maxclockspeed. That's nothing to worry about. The other condition, however, is that your system has gotten kinda hot, so it's slowing itself down. That's something to worry about.

To get a feel for the temperature of your system, you could run:

C:\> wmic /namespace:\\root\wmi PATH MSAcpi_ThermalZoneTemperature get 
CurrentTemperature
CurrentTemperature
3172

Now, it should be noted that pulling this temperature data isn't supported on all hardware, and on some hardware, it never changes beyond boot time. Still, on many modern non-virtual systems, it'll tell you your temperature in tenths of degree Kelvin. I just went to Google and did a search for "317.2 degrees kelvin to" and before I finished typing, the predictive search responded with:
317.2 kelvin = 44.05 degrees Celsius

Cool, Google. A little creepy, but cool. "Google: A little creepy, but cool" should be Google's new motto, supplanting "Don't be Evil."

Of course, then, I type "44.05 degrees Celsius to f" and it pops up and tells my system is running at 111.29 degrees Fahrenheit. Toasty.

The ScriptInternals guys have put together a list of the items you can read using this command besides the CurrentTemperature. You can pull all of this data with:

C:\> wmic /namespace:\\root\wmi PATH MSAcpi_ThermalZoneTemperature get *

While all this temperature stuff is nice, what about a prediction of whether our hard drive is hosed? We can pull that information with:

C:\> wmic /namespace:\\root\wmi PATH MSStorageDriver_FailurePredictStatus get 
predictfailure
PredictFailure
FALSE

Whew, that's a relief. If this output says TRUE, your drive is ready to give up the ghost soon, so you should backup immediately! You don't want to fall into the "Hal Pomeranz conference laptop deathwatch trap".

Tim sometimes wish a presenter's laptop would die:

We've all been there, a presentation where the presenter is just reading every word on every slide with no extra content or commentary. That presenter's laptop need to die, to take one for the team so the rest of us can live.

I've heard Ed and Hal present, both are great speakers, so their laptops are not required to become martyrs. Let's give them a bit of a check up.

Checking a laptop's status in PowerShell is very similar to what Ed did. Here are the PowerShell versions of Ed's commands.

Motherboard - Manufacturer and Serial Number:

PS C:\> gwmi win32_baseboard

CPU information - Description, Manufacturer, and Speed:

PS C:\> gwmi win32_processor

Temperature:

PS C:\> Get-WmiObject -class MSAcpi_ThermalZoneTemperature -Namespace root\WMI

We have the same problem as Ed, Kelvin. Let's convert to Fahrenheit. My undergraduate degree was in Engineering, and I had to take a Thermodynamics class. One thing I remember is that 0 Kelvin is 273.15 Celsius. I also remember how to convert Celsius to Fahrenheit: add 40, multiply by 9, divide by 5, and finally subtract 40. Here it is in only line.

PS C:\> (((Get-WmiObject -class "MSAcpi_ThermalZoneTemperature" -Namespace
"root\WMI").CurrentTemperature / 10 - 233.15) * 9 / 5) - 40
124.79

Let's check the drive status:

PS C:\> Get-WmiObject -class MSStorageDriver_FailurePredictStatus -Namespace root\WMI | Select Active, PredictFailure

Active PredictFailure
------ --------------
  True          False

Good news, the drive is alive, and not predicted to die!

One other think I like to check is the battery:

PS C:\> gwmi Win32_Battery | select est*

EstimatedChargeRemaining EstimatedRunTime
------------------------ ----------------
                      97              231

I can run for almost 4 hours. That's a long presentation, and a lot of slides to read.

Command Line Kung Fu

Tuesday, May 25, 2010

Episode #96: Hardware Death Watch

Pages

Contact us

Blog Archive

Followers

Contributors