Solaris Troubleshooting : Interpreting SCSI Errors from /var/adm/messages

scsi-errorsWhile handling storage related issues, it is important to Unix Admin to take the decision about disk drive replacement.  The interpretation and understanding of the cause of SCSI message blocks is one important part that influences the drive replacement decision.   This article will help you to Understand and Interpret various SCSI errors that appear in /var/adm/messages. For More information on SCSI you can refer the post “SCSI quick Reference for Unix Administrators

There are Several SCSI terms which will commonly appear in syslog:

  • Initiator — this is the SCSI standard name for what we commonly call  the host bus adapter (HBA).   This is the device which initiates commands, for example, read a disk or rewind a  tape.
  • Target — this is the SCSI standard name for the control port on a device such as a disk, tape, or library.   The term target is sometimes used (incorrectly) to refer to the logical unit, especially when the target supports only a single logical unit as is often the case for a disk or tape.   The distinction is clearer for RAID array controllers where a single target port may present several logical units.
  • Logical Unit — this is the SCSI standard name for a device such as a disk or tape which actually stores data.  It may also refer to non-storage devices such as tape libraries (media changer) or communications devices.
  • Logical Unit Number (LUN) — this is the address of a specific logical unit in the context of a specific initiator addressing a specific target to which the logical unit is mapped.   LUN is often (incorrectly) used to refer to the logical unit itself especially when there is  only one logical unit on the target.
  • I_T nexus — the path from a specific initiator (I) to a specific target (T).    In a FC fabric environment there may be many target ports visible  to an initiator.  Similarly in a FC fabric environment there may be many initiators which can “see” a given target port.   Each of these initiator/target pairs constitutes a distinct I_T nexus.
  • I_T_L nexus — a path from a specific initiator via a specific target port to a specific logical unit.   In a multipath environment a single logical unit may have multiple paths for fault tolerance, load balancing, etc.   Each such path from host to logical unit constitutes a distinct I_T_L nexus.    A single logical unit may have several I_T_L nexus’s  when  mapped to multiple I_T nexus using the same (or perhaps, rarely, different) logical unit numbers.
  • Command Descriptor Block (CDB) — the information passed  from initiator to target and logical unit that instructs the logical unit to perform some action (for example: read, write, or rewind).
  • SCSI status — the highest level of information passed from logical unit and target to initiator which indicates the completion status of a CDB.  These include GOOD status, CHECK CONDITION, etc.
  • Sense Data — the response data from logical unit/target to initiator when a REQUEST SENSE command is issued.   REQUEST SENSE is issued when the SCSI status of a CDB is abnormal (e.g. CHECK CONDITION).   When SCSI standards define this specific Sense Data an interpretation will be printed in the message block by Solaris, otherwise it will be reported as vendor specific and will require device specific information to decode.
  • Sense Key (SK)  — part of the Sense Data.   These include NO SENSE, MEDIA ERROR, HARDWARE ERROR, UNIT ATTENTION, SOFT ERROR, as common values.
  • Additional Sense Code (ASC) — part of the Sense Data.   Many ASC are defined in the SCSI standard while some are vendor specific and defined only in documents specific to the device.
  • Additional Sense Code Qualifier (ASCQ) — part of the Sense Data.   The ASCQ further refines the SK/ASC.  There may be vendor specific values defined in addition to those defined in the SCSI standards — seen vendor Interface or Product manuals for vendor specific values.
  • FRU — part of the Sense Data.  The FRU further refines the SK/ASC/ASCQ.  Again there may be vendor specific values in addition to those defined in the SCSI standards.
  • Transport Media — this is a general term that defines the basic technology of the SCSI communication.   Early SCSI used SCSI Parallel Interconnect (SPI) while later media use Fibre Channel Protocol (FCP — SCSI over Fibre Channel) and more recently Serial Attached SCSI (SAS).    Another transport media is iSCSI — SCSI over Internet Protocol (typically Ethernet).   The SCSI-3 standards have the command/status interaction separated from the physical transport media so a common description of commands and status can be used across all transports.   There may be additional  “transport errors” which may be specific to a particular transport — those are not discussed in this document.
  • We also need to define a few Solaris storage driver terms used to describe the various drivers in the SCSI device stack.
  • Target Driver — the Solaris driver at the top of the stack.  Typical target drivers are sd/ssd (“disk”), st (tape), and sgen (“generic” typically used for media changers) depending on the SCSI device type of the logical unit.  Each instance of the target driver corresponds to one I_T_L nexus (one path from host to a specific logical unit).   The term “target driver” is historical and somewhat confusing since it corresponds to the logical unit, not the target port.
  • SCSI Transport — the interface below the target drivers.   Encompasses generic “glue logic” and the HBA driver specific to the hardware and transport media in use.
  • HBA Driver — the bottom of the SCSI transport stack.   This driver is specific to the hardware used (though one driver may handle many variants/generations of the hardware in question).
  • SCSI_VHCI  (SCSI Virtual Host Controller Interface; also known as MPXIO and/or STMS (Solaris Traffic Manager System)_ — a layered driver that sits below the scsi device target drivers (e.g. sd/ssd) and above one or more physical HBA’s to provide a virtualized host controller interface which implements multipathing for IO load balancing, path fault tolerance, and a uniform naming convention for underlying logical units (“disks”).    Only a single sd/ssd instance will exist for each multipathed logical unit (disk).   Contrast this with the implementation of SYMANTEC Veritas Dynamic Multipathing (VXDMP) an EMC PowerPath (EMCPP) which drivers sit above the SCSI target driver and have an sd/ssd instance for each path to the logical unit.
Sample Error Blocks from /var/adm/messages

Example 1: Sample Solaris SCSI Log Messages with CHECK CONDITION SCSI status

Below is the Typical log that will appear in /var/adm/messages during a SCSI operation
 
scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50008f7fd73 (sd333):
Error for Command: write(10) Error Level: Retryable
scsi: [ID 107833 kern.notice] Requested Block: 223244214 Error Block: 223242161
scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0808T3RAT4
scsi: [ID 107833 kern.notice] Sense Key: Hardware Error
scsi: [ID 107833 kern.notice] ASC: 0x15 (mechanical positioning error), ASCQ: 0x1, FRU: 0x83
  • Line 1 identifies the device issuing the message.  In this case it is a scsi_vhci device at WWN 5000c50008f7fd73 known internally as sd instance 333.  In this instance the exact path (I_T_L nexus) is masked by the scsi_vhci driver.  That usually is not relevant as these messages reflect a condition of the logical unit.
  •  Line 2 identifies the operation code of the CDB (WRITE 10) and the sd driver “Error Level” — in this case Retryable.  The Error Level is Retryable until retry counts are exhausted when it  becomes Fatal and the error is reported to the caller (filesystem, application, etc.).
  •  Line 3 identifies the Requested Block (that appearing in the CDB) and the Error Block (the block in which the error occurred).  If the Requested Block and Error Block are the same then no data was transferred;   this is common when the command is not actuall started because the target/logical unit needed to report some changed status with UNIT ATTENTION.
  • Line 4 contains the logical unit SCSI Vendor ID (SEAGATE) and the logical unit serial number (0808T3RAT4).
  • Line 5 contains the Sense Key (SK, Hardware Error)
  • Line 6 contains the Additional Sense Code (ASC 0x15), the Additional Sense Code Qualifier (ASCQ, 0x1) and the FRU (0x83).  Combined with the sense key these identify the type of condition being reported.   In this case it is a SCSI standard SK/ASC “mechanical positioning error”.    This may reflect a problem with the “actuator” which positions the head or possibly some recorded servo issue or possibly some environmental issue such as mechanical vibration.
It is worth noting that the “Error Level” is a Solaris target driver artifact and outside the SCSI protocol and so long as the message block reports Error Level Retryable, no error is visible to the higher level caller (only the delay while retries are attempted).  Should the retry count be exhausted, then an Error Level Fatal will be reported and the higher level caller will get an error notification.

Example  2:  Sample SCSI message block reports  a UNIT ATTENTION sense key

scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50008371683 (sd372):
Error for Command: read(10) Error Level: Retryable
scsi: [ID 107833 kern.notice] Requested Block: 363528051 Error Block: 363528051
scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 5QJ068D2
scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred),  ASCQ: 0x0, FRU: 0x0
  • This is an sd instance 372 on the scsi_vhci device.   It was a READ 10 operation and the retry count was not exhausted.   No data was transfered.
  • The Sense Key is UNIT ATTENTION and ASC 0x29 (reset).   This is the way a target/logical unit reports the fact that it has been reset (different ASCQ/FRU for various reasons/sources,  ASCQ: 0x0 FRU: 0x0 is the generic reset without further details).
  •  UNIT ATTENTION on a pending or incoming command is the only mechanism the SCSI target/logical unit has to communicate some change in status at the target/logical unit.
  • UNIT ATTENTION does not reflect any error related to the command being reported.   It is merely the first available “victim” which can communicate the SCSI status “CHECK CONDITION” and Sense Key “UNIT ATTENTION”.   This is NOT a device error related to the SCSI command but only a notification of some status change.   This sort of error is not in and of itself reason to suspect the target/logical unit is defective.  This status is reported for each I_T_L nexus so they may appear in bursts for a multipathed logical unit or for a target presenting multiple logical units so a large number may be reported for a simple reset possibly initiated by operator action.

Example 3:  SCSI reporting an ASC related to “Predictive Failure Analysis” or “S.M.A.R.T trip”

 
scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50008f7f70f (sd355):
Error for Command: read capacity Error Level: Informational
scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0
scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0808T3RRNY
scsi: [ID 107833 kern.notice] Sense Key: Soft Error
scsi: [ID 107833 kern.notice] ASC: 0x5d (drive operation marginal, service immediately \
  (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x10
In this scenario:
  • The sd instance 355 accessed via scsi_vhci using the command “READ CAPACITY” which does not have any LBA argument,  so these are reported as zeros.
  • The Sense Key is “Soft Error” and the ASC is 0x5d — a special ASC returned when the internal drive monitoring/diagnostics has identified one of several conditions the vendor believes is “out of range” and suggests immediate drive replacement.
  • This is  a determination made in the drive firmware according to vendor proprietary guidelines and should warrant immediate drive replacement to avoid potential data loss and/or drive failure.
  • This condition is normally reported once for each I_T_L nexus but will be reported again if  the connection is lost and reestablished (e.g. by rebooting).
Example 4: Sample SCSI messages related to some RAID level Admninistration operation
scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@0/pci@d/SUNW,emlxs@0/fp@0,0/ssd@w50000972085881b0,3a (ssd9830):
Error for Command: read(10) Error Level: Retryable
scsi: [ID 107833 kern.notice] Requested Block: 9773024 Error Block: 9773024
scsi: [ID 107833 kern.notice] Vendor: EMC Serial Number: 64!oH000F
scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention
scsi: [ID 107833 kern.notice] ASC: 0x3f (reported LUNs data has changed), ASCQ: 0xe, FRU: 0x0
In this Scenario :
  • sd instance 9830 via Emulex HBA (emlxs) is reporting a CHECK CONDITION, UNIT ATTENTION, ASC: 0x3f for READ 10 issued to an EMC logical unit; no data was transfered.
  • In this particular case, the ASC: 0x3f “reported LUNs data” likely reflects a change in target port level information, typically the arrayside mapping or unmapping of a logical unit for this I_T nexus.
  • Solaris will respond by issuing a REPORT LUNS, updating the kernel information regarding this I_T nexus, and retry this READ 10 which we expect to succeed.
  • Like most UNIT ATTENTION conditions this does not reflect a failure of any kind, merely a notification of change in configuration.   The READ 10 will be retried and under normal conditions will complete normally.
  •  No defect is indicated here but  this logical unit is an EMC RAID volume so even if the ASC would suggest some defect and replacement, that would be the responsibility of the array support team (EMC in this case).
 
Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

1 Response

  1. November 8, 2014

    […] Solaris Troubleshooting : Interpreting SCSI Errors from … – sd instance 9830 via Emulex HBA (emlxs) is reporting a CHECK CONDITION, UNIT ATTENTION, ASC: 0x3f for READ 10 issued to an EMC logical unit; no data was …… […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us