|
Description
|
Category
driver
Sub-Category
sata
Description
Most of the time (~ 9/10 cases) onnv-gate build 112 release bits
panic on a ASUS N4L-VM DH mainboard during boot / zfs mount root
(in rare cases there is no panic and the system does boot):
NOTICE: error reading device label
NOTICE:
***************************************************
* This device in not bootable! *
* It is either offlined or detached or faulted. *
* Please try to boot from a different device. *
***************************************************
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@0,0/pci1043,8129@1f,2/disk@0,0:a fstype zfs
panic[cpu0]/thread=fffffffffbc2cfe0: vfs_mountroot: cannout mount root
genunix:vfs_mountroot+350 ()
genunix:main+f0 ()
unix:_locore_start+92 ()
build 112 debug bits are able to boot, most of the time. But there
have been rare cases where the same panic has been seen with build 112
debug bits.
-------------------
It seems that we get ahci interrupts (ahci_intr() -> ahci_intr_phyrdy_change())
before the ahci sata hba is registered (by a call to sata_hba_attach(), from
ahci_register_sata_hba_tran()). It seems the phyrdy change interrupt should
send a sata hba event, but sata_hba_event_notify() drops the event because
the sata_hba_list is empty (sata_hba_attach() has not yet been called)
at the point the ahci interrupt is received.
With build 111 no such ahci_intr_phyrdy_change() interrupt is generated.
And with build 111 (or older) the ASUS N4L-VM mainboard was able to boot
from zfs root.
-------------------
Apparently this has changed with the putback for 6753962 "ahci does
not work with Asus M3N78 Pro/M3N-HT (nforce 780a) motherboard SATA
interfaces".
ahci_initialize_port() was changed to call ahci_port_reset() when the
"port task file" interface status has the AHCI_TFD_STS_BSY or
AHCI_TFD_STS_DRQ bit set. On my ASUS N4L-VM DH the "port task file"
data returned by the hardware at ahci_initialize_port() time is 0x80
(= AHCI_TFD_STS_BSY) ==> ahci_port_reset() is called. It seems the
hardware sends a "phyrdy_change" interrupt after that port reset,
before the ahci sata hba is registered with the sata framework, so
that the "phyrdy_change" gets lost in sata_hba_event_notify().
Frequency
Often
Regression
Solaris 10
Steps to Reproduce
Try to boot build 112 from a S-ATA HDD on ASUS N4L-VM DH mainboard (ICH7-M),
when the intel chipset s-ata controller is configured for ahci mode.
Expected Result
Kernel is able to find the s-ata boot drive on ahci, and is able to boot
Actual Result
Kernel panics with
NOTICE: error reading device label
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@0,0/pci1043,8129@1f,2/disk@0,0:a fstype zfs
Error Message(s)
NOTICE: error reading device label
NOTICE:
***************************************************
* This device in not bootable! *
* It is either offlined or detached or faulted. *
* Please try to boot from a different device. *
***************************************************
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@0,0/pci1043,8129@1f,2/disk@0,0:a fstype zfs
panic[cpu0]/thread=fffffffffbc2cfe0: vfs_mountroot: cannout mount root
genunix:vfs_mountroot+350 ()
genunix:main+f0 ()
unix:_locore_start+92 ()
Test Case
Workaround
Additional configuration information
ASUS N4L-VM DH mainboard (ICH7-M)
- S-ATA HDD, Samsung 250GB
- S-ATA DVD-writer device
SX:CE build 104, bfu'ed to onnv-gate build 112
|
|
Comments
|
The 'prtconf -v' and 'scanpci -v' results provided by submitter are attached
Some trace/analysis works from the submitter are listed below,
---
[ahci_release_fail2.txt]
Here's a attachment, a log for a release kernel debug session, with a failed
root fs mount, that shows
port 0 *had* been recognized as a
ahcictl_ports[0]->ahciport_device_type = 0x1 (SATA_DTYPE_ATADISK)
ahcictl_ports[0]->ahciport_port_state = 0x10 (SATA_STATE_READY)
and port 2 as
ahcictl_ports[2]->ahciport_device_type = 0x40 (SATA_DTYPE_ATAPI)
ahcictl_ports[2]->ahciport_port_state = 0x10 (SATA_STATE_READY)
before ahci`ahci_intr_phyrdy_change, and they change to
SATA_DTYPE_NONE / SATA_STATE_PWRON afterwards.
[ahci_release_fail3.txt]
It depends on the exact timing when the phyrdy_change interrupt arrives.
In case we receive the interrupt before ahci_register_sata_hba_tran()
line 1065 runs ...
1064
1065 ahci_ctlp->ahcictl_sata_hba_tran = sata_hba_tran;
1066
... we have ahci_ctlp->ahcictl_sata_hba_tran == NULL, so that in
ahci_intr_phyrdy_change(), line 5132 the phyrdy_change interrupt
is ignored. In this case the valid ahci_portp->ahciport_port_state
isn't modified to SATA_PSTATE_PWRON. The disk can be found.
5132 if ((ahci_ctlp->ahcictl_sata_hba_tran == NULL) ||
5133 (ahci_portp == NULL)) {
5134 /* The whole controller setup is not yet done. */
5135 mutex_exit(&ahci_ctlp->ahcictl_mutex);
5136 return (AHCI_SUCCESS);
5137 }
If we receive the phyrdy_change interrupt after line 1065 has run,
we have ahci_ctlp->ahcictl_sata_hba_tran != NULL, and
ahci_intr_phyrdy_change() changes the ahciport_port_state
to SATA_PSTATE_PWRON. The disk is gone...
I've tried to trace in kmdb when the phyrdy change interrupt is asserted.
At the start of ahci_software_reset() the phyrdy change interrupt is 0;
at return from the functions it has the value of 1.
Single stepping shows that the phyrdy change interrupt is asserted
after line 3372
3351 SET_FIS_TYPE(h2d_register_fisp, AHCI_H2D_REGISTER_FIS_TYPE);
3352
3353 /* Set Command Header in Command List */
3354 cmd_header = &ahci_portp->ahciport_cmd_list[slot];
3355 BZERO_DESCR_INFO(cmd_header);
3356 BZERO_PRD_BYTE_COUNT(cmd_header);
3357 SET_COMMAND_FIS_LENGTH(cmd_header, 5);
3358
3359 SET_WRITE(cmd_header, 1);
3360
3361 (void) ddi_dma_sync(ahci_portp->ahciport_cmd_tables_dma_handle[slot],
3362 0,
3363 ahci_cmd_table_size,
3364 DDI_DMA_SYNC_FORDEV);
3365
3366 (void) ddi_dma_sync(ahci_portp->ahciport_cmd_list_dma_handle,
3367 slot * sizeof (ahci_cmd_header_t),
3368 sizeof (ahci_cmd_header_t),
3369 DDI_DMA_SYNC_FORDEV);
3370
3371 /* Indicate to the HBA that a command is active. */
3372 ddi_put32(ahci_ctlp->ahcictl_ahci_acc_handle,
3373 (uint32_t *)AHCI_PORT_PxCI(ahci_ctlp, port),
3374 (0x1 << slot));
A difference between build 111 and build 112 is that ahci_find_dev_signature()
is using ahci_software_reset() instead of ahci_port_reset().
With build 111 ahci_software_reset() is not called at boot time.
When I revert that part of the 6753962 changes, the N4L-VM can be booted:
diff --git a/usr/src/uts/common/io/sata/adapters/ahci/ahci.c b/usr/src/uts/commo
n/io/sata/adapters/ahci/ahci.c
--- a/usr/src/uts/common/io/sata/adapters/ahci/ahci.c
+++ b/usr/src/uts/common/io/sata/adapters/ahci/ahci.c
@@ -3846,25 +3846,21 @@ ahci_find_dev_signature(ahci_ctl_t *ahci
ahci_portp->ahciport_device_type = SATA_DTYPE_UNKNOWN;
- /* Issue a software reset to get the signature */
- if (ahci_software_reset(ahci_ctlp, ahci_portp, port)
- != AHCI_SUCCESS) {
- AHCIDBG1(AHCIDBG_INFO, ahci_ctlp,
- "ahci_find_dev_signature: software reset failed "
- "at port %d. cannot get signature.", port);
- ahci_portp->ahciport_port_state = SATA_PSTATE_FAILED;
+ /* Call port reset to check link status and get device signature */
+ (void) ahci_port_reset(ahci_ctlp, ahci_portp, port);
+
+ if (ahci_portp->ahciport_device_type == SATA_DTYPE_NONE) {
+ AHCIDBG1(AHCIDBG_INFO, ahci_ctlp,
+ "ahci_find_dev_signature: No device is found "
+ "at port %d", port);
return;
}
- /*
- * ahci_software_reset has started the port, so we need manually stop
- * the port again.
- */
- if (ahci_put_port_into_notrunning_state(ahci_ctlp, ahci_portp, port)
- != AHCI_SUCCESS) {
- AHCIDBG1(AHCIDBG_INFO, ahci_ctlp,
- "ahci_find_dev_signature: cannot stop port %d.", port);
- ahci_portp->ahciport_port_state = SATA_PSTATE_FAILED;
+ /* Check the port state */
+ if (ahci_portp->ahciport_port_state & SATA_PSTATE_FAILED) {
+ AHCIDBG2(AHCIDBG_ERRS, ahci_ctlp,
+ "ahci_find_dev_signature: port %d state 0x%x",
+ port, ahci_portp->ahciport_port_state);
return;
}
Further analysis and solution provided by the submitter.
The suggested fix from submitter is attached in [ahci-linkpm.txt]
---
I think I now have an explanation what happens on the ASUS N4L-VM DH:
The ASUS BIOS configures "Aggressive Link Power Management" on the
AHCI ports before it passes control to the OS bootstrap loader !
When I start a current onnv-gate release kernel with option "-kd",
step over the startup() call in main() and look at the AHCI P0CMD
register, I see:
[0]> ffa3bc00+100+18\X
0xffa3bd18: c040006
(AHCI memory mapped registers start at physical address ffa3bc00 on my box)
The P0CMD register has bits 26 (P0CMD.ALPE) and 27 (P0CMD.ASP) set.
Now the code in ahci_initialize_port() always disables link power management,
at line 2964, but that happens *after* ahci_find_dev_signature() at
line 2934, so
ahci_find_dev_signature() runs with enabled aggressive link power
management on the ASUS N4L-VM DH.
2923 /*
2924 * At the time being, only probe ports/devices and get the types of
2925 * attached devices during DDI_ATTACH. In fact, the device can be
2926 * changed during power state changes, but at the time being, we
2927 * don't support the situation.
2928 */
2929 if (ahci_ctlp->ahcictl_flags & AHCI_ATTACH) {
2930 /*
2931 * Till now we can assure a device attached to that HBA port
2932 * and work correctly. Now try to get the device signature.
2933 */
2934 ahci_find_dev_signature(ahci_ctlp, ahci_portp, port);
2935 } else {
2936
2937 /*
2938 * During the resume, we need to set the PxCLB, PxCLBU, PxFB
2939 * and PxFBU registers in case these registers were cleared
2940 * during the suspend.
2941 */
2942 AHCIDBG1(AHCIDBG_PM, ahci_ctlp,
2943 "ahci_initialize_port: port %d "
2944 "reset the port during resume", port);
2945 (void) ahci_port_reset(ahci_ctlp, ahci_portp, port);
2946
2947 AHCIDBG1(AHCIDBG_PM, ahci_ctlp,
2948 "ahci_initialize_port: port %d "
2949 "set PxCLB, PxCLBU, PxFB and PxFBU "
2950 "during resume", port);
2951
2952 /* Config Port Received FIS Base Address */
2953 ddi_put64(ahci_ctlp->ahcictl_ahci_acc_handle,
2954 (uint64_t *)AHCI_PORT_PxFB(ahci_ctlp, port),
2955
ahci_portp->ahciport_rcvd_fis_dma_cookie.dmac_laddress);
2956
2957 /* Config Port Command List Base Address */
2958 ddi_put64(ahci_ctlp->ahcictl_ahci_acc_handle,
2959 (uint64_t *)AHCI_PORT_PxCLB(ahci_ctlp, port),
2960
ahci_portp->ahciport_cmd_list_dma_cookie.dmac_laddress);
2961 }
2962
2963 /* Disable the interface power management */
2964 ahci_disable_interface_pm(ahci_ctlp, port);
Most likely the enabled aggressive link power management is responsible for
the "phyrdy change" that I observe.
The fix is probably to move the ahci_disable_interface_pm() up a few lines,
before ahci_find_dev_signature() runs. I'm now using an ahci module
changed with the attached patch. WIth that change onnv-gate is able to
boot on the ASUS N4L-VM DH mainboard.
|