|
Description
|
As part of some hotplug testing, I had offlined one of the
devices in my rootpool mirror (though this could been
asynchronous device failure, too). I was quite shocked to find
that rebooting resulted in a completely unbootable system -
Nothing but a cryptic "error 22" from spa_import_rootpool.
In any case, I tracked this down to the following bit of
code in spa_check_devstate()
for (c = 0; c < children; c++) {
char *physpath;
if (nvlist_lookup_string(child[c], ZPOOL_CONFIG_PHYS_PATH,
&physpath) != 0)
return (EINVAL);
if (spa_rootdev_validate(child[c])) {
if (strstr(devpath_list, physpath) == NULL)
return (EINVAL);
label_path++;
} else {
char *blank;
if (blank = strchr(dev, ' '))
*blank = '\0';
if (strcmp(physpath, dev) == 0)
return (EINVAL);
if (blank)
*blank = ' ';
}
}
grub_path = spa_count_devpath(devpath_list);
if (label_path != grub_path)
return (EINVAL);
spa_rootdev_validate() will return B_FALSE if the device is
offline, faulted, degraded, or removed. Some of the problems
here:
1. There is no reason why a degraded device should be treated
as if it were faulted.
2. This check is quite strange:
if (strcmp(physpath, dev) == 0)
return (EINVAL);
Apparently, if the device that is faulted or offline happens
to be the first device specified in the bootpath passed from
GRUB, then we fail immediately. But if it's not the first,
that's OK?
3. The code assumes that *ALL* devices must be presennt and usable:
if (label_path != grub_path)
return (EINVAL);
This is just bogus. There is no reason why missing half a mirror
should make the pool unbootable.
B
4. The error message from this scenario (or pretty much anything else
having to do with root pool failures) is completely incomprehensible.
*Any* failure that is specific to deriving the root pool configuration
should set a global string that can be used to provide a more
informative error message.
Using KMDB I whacked the return value from spa_check_devstate() to be 0,
and the system booted just fine, though then I ran into 6697301.
Booting with a removed drive, or pulling a drive and rebooting works okay.
and FMA-diagnosed drives (faulty or degraded)?
|