|
Description
|
For a long time now we've had a problem with trying to map GUIDs to bus/target/lun
numbers (cXtYdZ) if a "devfsadm -C" has been run after enabling MPxIO.
We have also had to incur a two-reboot penalty when enabling or disabling MPxIO so
that the mpxio-upgrade service can be run, make the appropriate changes to a
host's /etc/vfstab file and then reboot afterwards.
We do actually have sufficient information available in the device tree to obviate
one of the reboots, and to solve the significant customer dissatisfier of device
mapping post-devfsadm.
Arbitrary, MPxIO-enabled device:
disk, instance #8
Driver properties:
name='inquiry-serial-no' type=string items=1 dev=none
value='000742B94YPC J4V94YPC'
name='lba-access-ok' type=boolean dev=(28,512)
name='pm-components' type=string items=3 dev=none
value='NAME=spindle-motor' + '0=off' + '1=on'
name='pm-hardware-state' type=string items=1 dev=none
value='needs-suspend-resume'
name='ddi-failfast-supported' type=boolean dev=none
name='ddi-kernel-ioctl' type=boolean dev=none
name='device-nblocks' type=int64 items=1 dev=none
value=0000000011174b81
Hardware properties:
name='devid' type=string items=1
value='id1,sd@n5000cca00510a7cc'
name='inquiry-revision-id' type=string items=1
value='SA02'
name='inquiry-product-id' type=string items=1
value='HUS1514SB xxxxx 146G'
name='inquiry-vendor-id' type=string items=1
value='HITACHI'
name='inquiry-device-type' type=int items=1
value=00000000
name='compatible' type=string items=4
value='scsiclass,00.vHITACHI.pHUS1514SB xxxxx 146G.rSA02' + 'scsiclass,00.vHITACHI.pHUS
1514SB xxxxx 146G' + 'scsiclass,00' + 'scsiclass'
name='client-guid' type=string items=1
value='5000cca00510a7cc'
Paths from multipath bus adapters:
mpt#0 (online)
name='wwn' type=string items=1
value='5000cca00510a7cc'
name='target' type=int items=1
value=00000004
name='lun' type=int items=1
value=00000000
name='target-port' type=string items=1
value='5000cca00510a7cd'
name='path-class' type=string items=1
value='primary'
Device Minor Nodes:
dev=(28,512)
dev_path=/scsi_vhci/disk@g5000cca00510a7cc:a
spectype=blk type=minor
dev_link=/dev/dsk/c3t5000CCA00510A7CCd0s0
We can see from the above information that by combining the multipath information
for the hba mpt0, with the target and lun numbers a bus/target/lun triple may be
constructed. On this particular system the BTL triple turns out to be 0/4/0, so
stmsboot -L could then report that /scsi_vhci/disk@g5000cca00510a7cc:a mapped to
/dev/dsk/c0t4d0s0.
The same information is found with FC-attached devices:
disk, instance #45
Driver properties:
name='inquiry-serial-no' type=string items=1 dev=none
value='000633D01TU6 DG20P6801TU6'
name='pm-components' type=string items=3 dev=none
value='NAME=spindle-motor' + '0=off' + '1=on'
name='pm-hardware-state' type=string items=1 dev=none
value='needs-suspend-resume'
name='ddi-failfast-supported' type=boolean dev=none
name='ddi-kernel-ioctl' type=boolean dev=none
name='device-nblocks' type=int64 items=1 dev=none
value=0000000022ecb25c
Hardware properties:
name='devid' type=string items=1
value='id1,sd@n500000e012921d90'
name='inquiry-revision-id' type=string items=1
value='1303'
name='inquiry-product-id' type=string items=1
value='MAW3300FC xxxxx 300G'
name='inquiry-vendor-id' type=string items=1
value='FUJITSU'
name='inquiry-device-type' type=int items=1
value=00000000
name='compatible' type=string items=5
value='scsiclass,00.vFUJITSU.pMAW3300FC xxxxx 300G.r1303' + 'scsiclass,00.vFUJITSU.pMA
W3300FC xxxxx 300G' + 'scsa,00.bvhci' + 'scsiclass,00' + 'scsiclass'
name='client-guid' type=string items=1
value='500000e012921d90'
Paths from multipath bus adapters:
fp#3 (online)
name='node-wwn' type=byte items=8
value=50.00.00.e0.12.92.1d.90
name='port-wwn' type=byte items=8
value=50.00.00.e0.12.92.1d.91
name='target-port' type=string items=1
value='500000e012921d91'
name='target' type=int items=1
value=000000ca
name='lun' type=int items=1
value=00000000
name='sam-lun' type=int64 items=1
value=0000000000000000
name='path-class' type=string items=1
value='primary'
Device Minor Nodes:
dev=(28,2880)
dev_path=/scsi_vhci/disk@g500000e012921d90:a
spectype=blk type=minor
dev_link=/dev/dsk/c0t500000E012921D90d0s0
For this FC-attached device we can see that /scsi_vhci/disk@g500000e012921d90:a maps
to a BTL triple of fp#3 / 202 {0xca} / 0. On this particular host, fp#3 is controller
c8, so we'd see /dev/dsk/c8t202d0.
The logic for avoiding the second reboot is a little more complex, but since the
only reason we need that second reboot is to verify that the device has an MPxIO-
accepted GUID, we can avoid it.
For instance with the devices above, we need to know what the mapping is from BTL
to disk@g{client-guid}:{slicenum}. Now since we can determine the client-guid from
the devid using devid_to_guid(), we already know the slicenum, and we can work out
which driver will enumerate the device in MPxIO mode, therefore we can not only
update /etc/vfstab prior to reboot, but we can also create the device links as well.
usr/src/cmd/stmsboot/stmsboot_util.c::
571 * client_name Return value
572 * on sparc:
573 * .../fp@xxx/ssd@yyy CLIENT_TYPE_PHCI (fc)
574 * .../LSILogic,sas@xxx/sd@yyy CLIENT_TYPE_PHCI (sas)
575 * .../scsi_vhci/ssd@yyy CLIENT_TYPE_VHCI (fc)
576 * .../scsi_vhci/disk@yyy CLIENT_TYPE_VHCI (sas)
577 * other CLIENT_TYPE_UNKNOWN
578 * on x86:
579 * .../fp@xxx/disk@yyy CLIENT_TYPE_PHCI (fc)
580 * .../pci1000,????@xxx/sd@yyy CLIENT_TYPE_PHCI (sas)
581 * .../scsi_vhci/disk@yyy CLIENT_TYPE_VHCI
582 * other CLIENT_TYPE_UNKNOWN
pci1000,3150, instance #0
System software properties:
name='ddi-vhci-class' type=string items=1
value='scsi_vhci'
name='tape' type=string items=1
value='sctp'
name='mpxio-disable' type=string items=1
value='no'
Driver properties:
name='initiator-port' type=string items=1 dev=none
value='500605b0007c9dd0'
...
Device Minor Nodes:
dev=(169,0)
dev_path=/pci@0,0/pci10de,376@a/pci1000,3150@0:devctl
spectype=chr type=minor
dev=(169,1)
dev_path=/pci@0,0/pci10de,376@a/pci1000,3150@0:scsi
spectype=chr type=minor
dev_link=/dev/cfg/c1
So we now know that this mpt0 is /dev/cfg/c1 so any target/lun tuple which has
mpt0 as its parent (from client multipath info), would be on c1.
Obtaining the controller number in general will require opening the dev_path
of the minor node ($mumble:fc for fp-attached, and $burble:scsi for SAS-attached)
and then retrieving the di_devlink_path() information. Some handwaving might
still be required, but it'll be a fairly simple matter.
Redesign of stmsboot(1m)
------------------------
** Guiding principles:
* only _one_ reboot should be required
* listing MPxIO-enabled devices should be _fast_
* minimise filesystem-dependent lookups
* use libdevinfo and devlinks
** Rationale
Starting from the proposition that stmsboot had decayed enough to require
redesigning and re-writing from scratch, I started with the overall effects
that we need to achieve with this command:
* enable MPxIO for all MPxIO-capable devices
* enable MPxIO for specific MPxIO-capable drivers
* enable MPxIO for specific MPxIO-capable HBA ports
* disable MPxIO for all MPxIO-capable devices
* disable MPxIO for specific MPxIO-capable drivers
* disable MPxIO for specific MPxIO-capable HBA ports
* update MPxIO settings for all MPxIO-capable drivers
* update MPxIO settings for specific MPxIO-capable drivers
* list the mapping between non-MPxIO and MPxIO-enabled devices
* list device guids, if available
The existing code makes use of a shell script (/usr/sbin/stmsboot),
a private binary (/lib/mpxio/stmsboot_util) and an SMF service
(mpxio-upgrade) which runs on reboot.
The private binary does the heavy lifting, providing a way for the shell
script and SMF service to determine what a device's new MPxIO or non-MPxIO
mapping is. The private binary also walks through the device link entries in
/dev/rdsk when called with the -L or -l $controller options, printing any
device mappings. Finally, the private binary handles the task of re-writing
/etc/vfstab.
The shell script (stmsboot) is the user interface part of the facility. Its
chief task is to do editing of the driver.conf files for the supported drivers
(fp and mpt at this point), and to set the eeprom bootpath variable on the
x86/x64 platform if disabling or updating MPxIO configurations. (Failing to do
this would prevent an x86/x64 host from booting). The shell script also makes
backup copies of modified files, and creates a file with instructions on how
to recover a system which has failed to boot properly after running the
stmsboot script.
The SMF service is armed by the stmsboot script, and runs on reboot. It mounts
/usr and / as read-write, invokes the private stmsboot_util binary to rewrite
the /etc/vfstab, updates the dump configuration and any SVM metadevice
(/dev/md) device mappings, and then reboots the system.
** What has changed
The new design makes use of a private cache of device data gathered from
libdevinfo functions, and obviates the requirement for a second reboot since
the vfstab rewriting function is reliable. In addition, the new design
provides a significant improvement in execution time when listing device
mappings: we don't need to trawl through device links on disk but instead use
libdevinfo functions to provide the required information.
The data that we store in the cache for each device attached to an
MPxIO-capable controller is
* its devid,
* its physical path (eg, /pci@0,0/pci10de,5c@9/pci108e,4131@1/sd@0,0),
* its devlink path (eg, /dev/dsk/c2t0d0, which becomes c2t0d0)
* its MPxIO-enabled devlink path (eg, /dev/rdsk/c3t500000E011637CF0d0,
which becomes c3t500000E011637CF0d0)
* whether MPxIO is enabled for the device in the running system
(as a boolean_t B_TRUE or B_FALSE)
These are stored as nvlist properties:
#define NVL_DEVID "nvl-devid"
#define NVL_PATH "nvl-path"
#define NVL_PHYSPATH "nvl-physpath"
#define NVL_MPXPATH "nvl-mpxiopath"
#define NVL_MPXEN "nvl-mpxioenabled"
When we've found an MPxIO-capable device, we check whether it exists in our
cached version, and if not, we create an nvlist containing the above properties
and keyed off the device's devid. This nvlist is added to the global nvlist.
In order to speed operations later, we also add some inverse mappings to the
global nvlist:
devfspath -> devid
current devlink path -> devid
current MPxIO-enabled path -> devid
device physical path -> devid
This allows us to search for any of those paths and get the appropriate devid
back, the nvlist of which we can then query for the desired properties.
When the mpxio-upgrade service is invoked, we need to determine the mapping
for the root device in the currently running system and mount that device as
read-write in order to continue with the boot process. We do this by reading
the entry for / in /etc/vfstab and finding the physical path of that device in
the running system. We mount /devices/$physicalpath as read-write, then
re-invoke stmsboot_util to find the devlink (/dev/dsk...) path for /, which we
then remount. This two-remount option is required because the devlink facility
is not available to us at this early stage of the boot process - until we can
determine what the root device is and mount it as read-write.
Once root and /usr have been remounted, we can then invoke stmsboot_util to
re-write the vfstab. This is a fairly simple process of scanning through each
line of the file and finding those which start with /dev/dsk, determining
their mapping in the current system, and re-writing that line. As a safeguard,
the new version of the vfstab is written to /etc/mpxio, and we let the
mpxio-upgrade script take care of copying that file to /etc/vfstab. Once the
vfstab has been updated, we run dumpadm, and if necessary, metadevadm. Finally,
we re-generate the system's boot archive - which in fact is the longest single
operation of all!
After this, we can disable the mpxio-upgrade service and exit.
When the mpxio-upgrade script exits, the filesystem/usr service takes over and
the boot process completes normally - with the new device mappings already
active and working.
|