Solaris Troubleshooting SDS/SVM : "device busy too long" error from sd / ssd

This message has nothing to do with a connectivity issue to the target device – it applies when that device has reported back a SCSI status 8 (“busy” status) to the host, and to report back a SCSI status, the target must be able to communicate with the host.

Here is an example of the error message. In this case, the host is running with MPxIO (STMS) to a Fibre Channel array using the ssd target driver –

[date / time] [hostname] scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60000000000000000000000000000000 (ssd0):

[date / time] [hostname]       device busy too long

A default number of retries is used by sd / ssd for retrying a “busy” status, unless the driver has specific parameters for a target device which needs additional retries for a “busy” status (e.g. T3, 6120 and some other targets).

If the target device reports back a SCSI busy status, then sd / ssd retries that I/O with its default number of retries (3 for ssd, 5 for sd), with a delay of 5s (SSD_BSY_TIMEOUT) between each retry:

For example, from ssd.c:

/*

* Set the busy retry count to the default value of ssd_retry_count.

* This can be overridden by entries in ssd.conf or the device

* config table.

*/

un->un_busy_retry_count = ssd_retry_count;

Wait 5s before sending each retry, for example, from ssd.c:

* Restart the command (w/o reset) */

ssd_requeue_cmd(un, bp,

SSD_BSY_TIMEOUT);

action = JUST_RETURN;

During the retries, there will be one attempt by sd / ssd to reset the target device. However, if the target device continues to respond with a busy status, then sd / ssd will eventually fail that I/O after all retries have been exhausted.

For example, from ssd.c:

/* Max retries reached; fail the command */

scsi_log(SSD_DEVINFO, ssd_label, CE_WARN,

“device busy too longn”);

action = COMMAND_DONE_ERROR;

And the code path for COMMAND_DONE_ERROR ends up returning EIO (errno 5) to the caller. For further details, see the source code in ssd_check_error().

If this error message is seen, the target device needs further investigation, to find out why it has reported a SCSI “busy” status for so long.

This behaviour was introduced to avoid a system hanging due to a target device which returned a SCSI “busy” status indefinitely. This way, after a reasonable amount of time, a volume management layer (VxVM or SDS) will get an I/O error for accessing one copy of the data, and will then be able to use the mirror copy instead.

For improved availability, that mirror copy of the data should be on a different target and not just a different LUN on the same target, in case that target is faulty and responds with a continuous “busy” status for all of its LUNs.

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

1 Response

  1. September 16, 2015

    […] Read – “device busy too long” error from sd / ssd […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us