Troubleshooting LED on a FAS

Introduction

Modern FAS systems include a number of amber fault LEDs strategically located to assist the operator in identifying Field Replaceable Unit (FRUs) that are in need of attention.

Most FRUs are contained within another FRU. For example:

  • Controller, IOXM, power supply, fan and disk drive FRUs are contained within a chassis FRU
  • PCI card FRUs are contained in both controller and IOXM FRUs
  • DIMM and boot device FRUs are contained in controller FRUs

When an FRU is in need of operator attention, its corresponding FRU fault LED will be illuminated. If that FRU is inside another FRU, the outer FRU’s fault LED will also be illuminated. This process of illuminating each FRU’s outer FRU fault LED is repeated until the outermost FRU is reached; resulting in a path of amber fault LEDs that can be followed to find the innermost FRU that needs attention.

While not all products have FRUs of all types, the hierarchy of FRUs is the same on all FAS products. Starting with the outermost FRU, the hierarchy appears similar to the following:

Chassis
\
+- Power Supply
+- Fan
+- Disk Drive
+- Controller
|   \
|    + PCI Card
|    + DIMM
|    + NV-DIMM
|    + Boot Device
|    + Coin Cell Battery
|    + NVMEM Battery
+- IOXM
\
+ PCI Card

For example, if a DIMM in the controller requires attention, its fault LED will be lit along with the controller’s fault LED and the chassis fault LED.

FRU fault LEDs that are not visible from the outside of the system remain on when the containing FRU (usually a controller or IOXM) is removed from the chassis. This allows the FRU requiring attention to be easily located. Current versions of Data ONTAP do not detect when FRUs have been serviced and therefore do not turn off the FRU fault LEDs after the faulty FRU has been replaced. As a result, even after replacing a faulty FRU that is not visible from outside, the path of fault LEDs will remain on until Data ONTAP is explicitly commanded to turn them off. Typically the fault LEDs can be turned off running the disruptive halt -s command or non-disruptively running the (diag privilege) command fru_led off all.

Current versions of Data ONTAP do not maintain a database of faults that have occurred. Instead, when a fault occurs, a notification message is logged and the hierarchy of fault LEDs are lit to create a path of LEDs to the faulty FRU. As such, determining the cause of the fault requires some investigation.

The following are the sections which describe various fault LED conditions and how to troubleshoot them:

  1. Section H1 – Start here
  2. Section D1 – Disk fault
  3. Section H2 – Try using the fru_led command
  4. Section H3 – IOXM internal FRU fault
  5. Section H4 – fru_led command not available
  6. Section H5 – Internal FRU fault
  7. Section H6 – Clear internal FRU fault LEDs
  8. Section E1 – Adverse environmental condition reported by Data ONTAP
  9. Section E2 – Adverse environmental condition reported by SP
  10. Section C1 – Check for Data ONTAP configuration errors
  11. Section B1 – Potential bug: open a case

Cause

Conditions that light the fault LEDs fall loosely into the following three categories:

  • (H) Hardware errors caused by faults in individual FRUs
  • (E) Environmental conditions that are outside of the range specified by the hardware design
  • (C) Configuration errors in Data ONTAP

In general, remediation of category (H) conditions requires physical interaction with the identified FRU. This might involve reseating a DIMM, swapping DIMMs that were placed in the wrong slots, or replacing a FRU that is defective.

Remediation of category (E) conditions usually requires modification of the environment, such as in the case of over-temperature.

Remediation of category (C) conditions does not require attending to the hardware. This category typically requires some action by the administrator at the user interface; such as performing a giveback, rebooting from maintenance mode or re-enabling HA failover. In Data ONTAP versions 8.1.3RC1 and 8.2RC1 or later, most of the misconfiguration conditions that light the fault LEDs were removed as they intended to cause more confusion than clarity.

Before starting the process of diagnosing why the fault LED is on, ensure that Data ONTAP is running normally; that is, not in the maintenance mode or in takeover mode. When Data ONTAP is not running normally, the controller and chassis fault LEDs will be on.

Solution

Perform the following steps to investigate why the fault LED is on:

Section H1 – Start here

  1. Determine which version of Data ONTAP is being run on the system
  2. Physically inspect the system. Determine if:
    • The chassis fault LED is on
    • The controller fault LED is on
    • The IOXM fault LED is on  (for systems with an IOXM)
    • One or more embedded disks have a fault LED on  (for systems with disks in the controller’s chassis)
  3. Using the following table, find the FIRST line that matches the state of the LEDs inspected in Step 2. Continue diagnosis at the section listed in the last column of that row. Note that a dash (-) in the following table means that the LED’s state is not important for that particular row and can be in either the on or off state:

Embedded                             Continue
Disk      Chassis  Controller  IOXM  diagnosis at
--------  -------  ----------  ----  ------------
On        On       -           -     Section D1
Off       On       Off         Off   Section E1
Off       On       On          -     Section C1
Off       On       -           On    Section H3

Section D1 – Disk fault

  • The chassis fault LED and disk fault LED(s) are on because of a disk fault
  • Remedy the disk fault(s)
  • If any fault LEDs remain on, start over from the top of Section H1.

Section H2 – Try using the fru_led command

  1. From the Data ONTAP nodeshell CLI prompt, run the following commands:
    node> priv set diag
    node*> fru_led status
  2. If Data ONTAP reports that the fru_led command is not found, go to Section H4.
    fru_led status reports the status of the controller and IOXM FRU LED and the status of any internal FRU fault LEDs.
  3. If any of the internal FRU fault LEDs (DIMM, DIMM-NV, PCI slot, NV battery, Boot device or Coin cell) are on, go to Section H5. Otherwise, continue with this section.
  4. If fru_led status reports that the controller LED is ‘lit by SP’, run the following command from the SP CLI prompt:
    SP node> system sensors
  5. Look for abnormal readings for sensors, specifically NV Battery, RTC Battery, Temperature, Voltage and Current.
  6. If no cause is identified, go to Section E1. Otherwise, if any fault LEDs remain on, start over from the top of Section H1.

Section H3 – IOXM internal FRU fault

The IOXM and chassis fault LEDs are on because a fault was detected on a PCI plug-in card that is located inside the IOXM.

  1. If Data ONTAP 7-Mode, run the following Data ONTAP command:
    node> savecore -l
    If clustered Data ONTAP, run the following Data ONTAP command:
    cluster1::> system node coredump show -instance
  2. Look for a core with a panic string referring to a PCI error on the IO Expansion
    Also, the details of the fault stored in /etc/log/ssram/ssram.log, although not all FAS products generate this file, is found.
  3. If you are running Data ONTAP 8.1.3RC1, 8.2RC1, you can also use the following Data ONTAP command to view which PCI slot FRU fault LEDs are currently on:
    node> priv set diag
    node*> fru_led status
  4. If a cause cannot be identified, go to Section B1. Otherwise, continue with this section.
  5. After servicing the FRU in question, it will be necessary to manually turn off the FRU fault LEDs. Go to Section H6

Section H4 – fru_led command not available

Note: Versions of Data ONTAP prior to 8.1.3RC1 and 8.2RC1 do not have the fru_led command.

  1. At the SP CLI prompt, run the following command:
    SP node> priv set diag; system fru led show 0
  2. If FRU LED ID 0 is reported as on, go to Section E2. Otherwise, continue with this section.
  3. Follow Steps 1 and 2 in Section E1.
  4. If no cause has been identified, continue with this section. Otherwise, if any fault LEDs remain on, start over from the top of Section H1
    Note: This step requires the removal of the controller, which is potentially a disruptive operation.
  5. During a convenient service window, perform a takeover from the HA partner (if required) and then remove the controller from the chassis.
  6. Inspect the motherboard for any LEDs that remain on. If any LEDs remain on:
    1. Note the FRUs nearest to the LEDs that remain on
    2. Replace the controller and wait for Data ONTAP to boot normally; a giveback from the HA partner might be required
    3. Go to Section H5
  7. If no cause is identified, go to Section B1

Section H5 – Internal FRU fault

  1. If a DIMM-NV FRU fault light is on, first identify if the product’s NVRAM sub-system uses a portion of the main system memory or has dedicated memory. To perform this check, run the following Data ONTAP nodeshell command:
    node> sysconfig -a 0

    If the System Board section contains a sub-section named NVMEM Size, your product uses a portion of system memory; go to Step 2. Otherwise, your product has dedicated memory; continue with this step. The DIMM-NV FRU fault light is on due to one of the following:

    1. Detection of an unsupported or incorrectly sized DIMM
    2. Detection of an uncorrectable ECC error
      In case (a), Data ONTAP will emit nvram.hw.initFail EMS messages during bootup followed by an nv.none EMS message, after which Data ONTAP will halt and stop at the LOADER prompt.
      In case (b), Data ONTAP will panic, reporting an uncorrectable memory error on NVRAM in slot 0.
  2. If a DIMM (or DIMM-NV, for products that use a portion of shared memory) FRU fault light is on, the cause is due to one of the following:
    1. Detection of a DIMM installed in the wrong slot
    2. Detection of a missing or unresponsive DIMM
    3. Detection of an uncorrectable ECC error
      In cases (a) and (b), Data ONTAP will halt at the LOADER prompt after reporting the error on the serial console. If you do not have a serial console attached, you can run the SP CLI system log command to review serial console history.
      In case (c), Data ONTAP will have panicked to protect data integrity. Run the following Data ONTAP command:
      (7-Mode) node> savecore -l 
      (cDOT) cluster::> system node coredump show -instance 
      Look for a core with a panic string referring to an ECC error. You might also find details of the fault stored in /etc/log/ssram/ssram.log, although not all FAS products generate this file.
  3. If a PCI slot fault LED is on, the cause is due to the detection of a fault on the PCI plug-in card in that slot. Data ONTAP will have panicked to protect data integrity. Run the following Data ONTAP command:
    (7-Mode) node> savecore -l 
    (cDOT) cluster::> system node coredump show -instance  Look for a core with a panic string referring to a PCI error on the Controller.
    You might also find details of the fault stored in /etc/log/ssram/ssram.log, although not all FAS products generate this file.
  4. If the Boot device fault LED is on, the cause is due to one of the following:
    1. The Boot device is absent
    2. The installed Boot device is not approved for use in FAS products
    3. Data ONTAP has detected a fault with the installed Boot device
      In cases (a) and (b), the Ignoring unsupported device boot_i message will be emitted to the console. Run the following Data ONTAP nodeshell command:
      node> sysconfig 0
      Look for a device named boot0 on slot 0. The absence of such a device indicates that it is either missing (case a) or has been disabled (case b).
      In case (c), run the following Data ONTAP nodeshell command:
      node> environ chassis list-sensors
      Look for the sensor named Usbmon Status. A WARN status indicates that Data ONTAP detected a fault on the device and illuminated the Boot device fault LED. You will also find the following EMS message has been logged: usbmon.boot.device.pfa: Failure predicted for boot device boot_i
  5. If the NV battery fault LED is on, run the following Data ONTAP nodeshell command:
    node> environ chassis list-sensorsLook for sensors named NVRAM Batt Rem. Capacity and NVRAM Batt Voltage. A critlow status indicates an issue with the NVRAM battery. Also, the following EMS messages are generated for NVRAM battery faults:

    • monitor.shutdown.nvramLow.Battery.pending
    • nvmem.battery.capacity.low
    • nvmem.battery.voltage.high
    • monitor.shutdown.emergency
  6. If the Coin Cell fault LED is on, the SP has detected one of the following:
    • The coin cell battery is missing
    • The coin cell’s voltage is out of the expected range
      When Data ONTAP detects the coin cell’s sensor going out of range, it will emit the following EMS message:
      [monitor.chassisPower.degraded] Bat 3.0V is critical high (3500).
      Replace the CMOS coin cell battery.
      The sensor can also be directly inspected by running the following Data ONTAP nodeshell command:
      node> environ chassis list-sensors
      and checking the status of the Bat 3.0V sensor.
  7. Go to Section H6

Section H6 – Clear internal FRU fault LEDs

Perform the following steps:

  1. Run the following Data ONTAP nodeshell commands to manually clear the internal FRU fault LEDs:
    node> priv set diag
    node*> fru_led off all
  2. If Data ONTAP reports that the fru_led command is not found, run the following Data ONTAP nodeshell command during a convenient service window. Your HA partner will perform a takeover:
    node> halt -s
  3. Boot Data ONTAP normally, performing a giveback from the HA partner, if required.
  4. If any fault LEDs remain on, start over from the top of Section H1.

Section E1 – Adverse environmental condition reported by Data ONTAP

  1. Run the following Data ONTAP nodeshell command:
    node> environ chassis list-sensors
  2. In the output of this command, look for sensors that are marked as failed instead of normal.
    If any sensors are marked as failed, the fault LED is on because of the failed sensor readings.
  3. If no cause has been identified, go to Section B1. Otherwise, if any fault LEDs remain on, start over from the top of Section H1

Section E2 – Adverse environmental condition reported by SP

  1. At the SP CLI prompt, run the following command:
    SP node> system sensors
  2. Check the Status column for failed sensors
  3. If found any, the controller and chassis fault light are on because of the failed sensors.
  4. If no cause has been identified, go to Section B1. Otherwise, if any fault LEDs remain on, start over from the top of Section H1

Section C1 – Check for Data ONTAP configuration errors

  1. If your system is running a Data ONTAP version at or subsequent to 8.1.3RC1 or 8.2RC1, go to Section H2. Otherwise, continue in this section.
  2. In Data ONTAP versions prior to 8.1.3RC1 and 8.2RC1, the controller and chassis fault LEDs are lit when one of following three important configuration conditions exist:
    1. NFS is licensed, but not enabled
    2. CIFS is licensed, but not enabled
    3. CF is licensed, but not enabled
  3. At the Data ONTAP nodeshell CLI prompt, run the following commands:
    node> license
    node> nfs status
    node> cifs status
    node> cf status
  4. If any of nfs, cifs or cf is listed as licensed (by the license command) but listed as disabled by the corresponding status command, the controller and chassis fault LEDs will be lit. Either delete the license or enable the feature to clear the fault condition.
  5. If any fault LEDs remain on, go to Section H2

Section B1 – Potential bug: open a case

  1. Check the public reports of BUG 686701 and BUG 745010
  2. If no cause has been identified, you might have encountered a new bug in Data ONTAP. Collect the following information from your system and open a new technical case with NetApp:
    1. From Data ONTAP nodeshell, run the following commands:
      node> priv set diag
      node*> options autosupport.doit <case#>
      node*> sysconfig -a
      node*> fru_led status (if the command is not found or not supported on the platform, skip this step)
      node*> show_faults
      node*> environ chassis list-sensors
      node*> rdfile /etc/log/ssram/ssram.log
      node*> savecore -l
    2. From the SP CLI:
      SP node> priv set diag
      SP node*> system sensors
      SP node*> events all
      SP node*> sp status -d
  1. If any fault LEDs remain lit (on), perform the following steps to clear them:
    1. Reboot the SP:
      sp reboot
    2. If the SP reboot does not clear them, run the following Data ONTAP nodeshell command during a convenient service window. Your HA partner will perform a takeover:
      node> halt -s
    3. Boot Data ONTAP normally, performing a giveback from the HA partner, if required.
twitterlinkedinmailtwitterlinkedinmail
Arco

About

View all posts by