OpenSolaris

Printable Version Enter a New Search
Bug ID 6824745
Synopsis ahci doesn't find sata disk at boot, after the putback for 6753962
State 10-Fix Delivered (Fix available in build)
Category:Subcategory driver:ahci
Keywords ALPM | ahci | opensolaris | oss-reques | oss-sponsor
Sponsor
Submitter keil
Responsible Engineer Ying Tian
Reported Against snv_112
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_114
Fixed In snv_114
Release Fixed solaris_nevada(snv_114)
Related Bugs 6753962 , 6863457
Submit Date 1-April-2009
Last Update Date 13-May-2009
Description
Category
   driver
Sub-Category
   sata
Description
   Most of the time (~ 9/10 cases) onnv-gate build 112 release bits
panic on a ASUS N4L-VM DH mainboard during boot / zfs mount root
(in rare cases there is no panic and the system does boot):
NOTICE: error reading device label
NOTICE:
  ***************************************************
  *  This device in not bootable!                   *
  *  It is either offlined or detached or faulted.  *
  *  Please try to boot from a different device.    *
  ***************************************************
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@0,0/pci1043,8129@1f,2/disk@0,0:a fstype zfs
panic[cpu0]/thread=fffffffffbc2cfe0: vfs_mountroot: cannout mount root
genunix:vfs_mountroot+350 ()
genunix:main+f0 ()
unix:_locore_start+92 ()
build 112 debug bits are able to boot, most of the time. But there
have been rare cases where the same panic has been seen with build 112
debug bits.
               -------------------
It seems that we get ahci interrupts (ahci_intr() -> ahci_intr_phyrdy_change())
before the ahci sata hba is registered (by a call to sata_hba_attach(), from 
ahci_register_sata_hba_tran()).  It seems the phyrdy change interrupt should
send a sata hba event, but sata_hba_event_notify() drops the event because
the sata_hba_list is empty (sata_hba_attach() has not yet been called)
at the point the ahci interrupt is received.
With build 111 no such ahci_intr_phyrdy_change() interrupt is generated.
And with build 111 (or older) the ASUS N4L-VM mainboard was able to boot
from zfs root.
               -------------------
Apparently this has changed with the putback for 6753962 "ahci does
not work with Asus M3N78 Pro/M3N-HT (nforce 780a) motherboard SATA
interfaces".
ahci_initialize_port() was changed to call ahci_port_reset() when the
"port task file" interface status has the AHCI_TFD_STS_BSY or
AHCI_TFD_STS_DRQ bit set.  On my ASUS N4L-VM DH the "port task file"
data returned by the hardware at ahci_initialize_port() time is 0x80
(= AHCI_TFD_STS_BSY) ==> ahci_port_reset() is called.  It seems the
hardware sends a "phyrdy_change" interrupt after that port reset,
before the ahci sata hba is registered with the sata framework, so
that the "phyrdy_change" gets lost in sata_hba_event_notify().
Frequency
   Often
Regression
   Solaris 10
Steps to Reproduce
   Try to boot build 112 from a S-ATA HDD on ASUS N4L-VM DH mainboard (ICH7-M), 
when the intel chipset s-ata controller is configured for ahci mode.
Expected Result
   Kernel is able to find the s-ata boot drive on ahci, and is able to boot
Actual Result
   Kernel panics with 
NOTICE: error reading device label
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@0,0/pci1043,8129@1f,2/disk@0,0:a fstype zfs
Error Message(s)
   
NOTICE: error reading device label
NOTICE:
  ***************************************************
  *  This device in not bootable!                   *
  *  It is either offlined or detached or faulted.  *
  *  Please try to boot from a different device.    *
  ***************************************************
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@0,0/pci1043,8129@1f,2/disk@0,0:a fstype zfs
panic[cpu0]/thread=fffffffffbc2cfe0: vfs_mountroot: cannout mount root
genunix:vfs_mountroot+350 ()
genunix:main+f0 ()
unix:_locore_start+92 ()
Test Case
   
Workaround
   
Additional configuration information
   ASUS N4L-VM DH mainboard (ICH7-M)
- S-ATA HDD, Samsung 250GB
- S-ATA DVD-writer device
SX:CE build 104, bfu'ed to onnv-gate build 112
Work Around
N/A
Comments
The 'prtconf -v' and 'scanpci -v' results provided by submitter are attached
Some trace/analysis works from the submitter are listed below,

---

[ahci_release_fail2.txt]
Here's a attachment, a log for a release kernel debug session, with a failed
root fs mount, that shows

port 0 *had* been recognized as a
    ahcictl_ports[0]->ahciport_device_type = 0x1  (SATA_DTYPE_ATADISK)
    ahcictl_ports[0]->ahciport_port_state = 0x10 (SATA_STATE_READY)

and port 2 as
    ahcictl_ports[2]->ahciport_device_type = 0x40 (SATA_DTYPE_ATAPI)
    ahcictl_ports[2]->ahciport_port_state = 0x10 (SATA_STATE_READY)

before ahci`ahci_intr_phyrdy_change, and they change to
SATA_DTYPE_NONE / SATA_STATE_PWRON afterwards.

[ahci_release_fail3.txt]
It depends on the exact timing when the phyrdy_change interrupt arrives.

In case we receive the interrupt before ahci_register_sata_hba_tran()
line 1065 runs ...

  1064	
  1065		ahci_ctlp->ahcictl_sata_hba_tran = sata_hba_tran;
  1066
	
... we have ahci_ctlp->ahcictl_sata_hba_tran == NULL, so that in
ahci_intr_phyrdy_change(), line 5132 the phyrdy_change interrupt
is ignored. In this case the valid ahci_portp->ahciport_port_state
isn't modified to SATA_PSTATE_PWRON. The disk can be found.

  5132		if ((ahci_ctlp->ahcictl_sata_hba_tran == NULL) ||
  5133		    (ahci_portp == NULL)) {
  5134			/* The whole controller setup is not yet done. */
  5135			mutex_exit(&ahci_ctlp->ahcictl_mutex);
  5136			return (AHCI_SUCCESS);
  5137		}


If we receive the phyrdy_change interrupt after line 1065 has run,
we have ahci_ctlp->ahcictl_sata_hba_tran != NULL, and
ahci_intr_phyrdy_change() changes the ahciport_port_state
to SATA_PSTATE_PWRON. The disk is gone...

I've tried to trace in kmdb when the phyrdy change interrupt is asserted.

At the start of ahci_software_reset() the  phyrdy change interrupt is 0;
at return from the functions it has the value of 1.

Single stepping shows that the phyrdy change interrupt is asserted
after line 3372

  3351		SET_FIS_TYPE(h2d_register_fisp, AHCI_H2D_REGISTER_FIS_TYPE);
  3352	
  3353		/* Set Command Header in Command List */
  3354		cmd_header = &ahci_portp->ahciport_cmd_list[slot];
  3355		BZERO_DESCR_INFO(cmd_header);
  3356		BZERO_PRD_BYTE_COUNT(cmd_header);
  3357		SET_COMMAND_FIS_LENGTH(cmd_header, 5);
  3358	
  3359		SET_WRITE(cmd_header, 1);
  3360	
  3361		(void) ddi_dma_sync(ahci_portp->ahciport_cmd_tables_dma_handle[slot],
  3362		    0,
  3363		    ahci_cmd_table_size,
  3364		    DDI_DMA_SYNC_FORDEV);
  3365	
  3366		(void) ddi_dma_sync(ahci_portp->ahciport_cmd_list_dma_handle,
  3367		    slot * sizeof (ahci_cmd_header_t),
  3368		    sizeof (ahci_cmd_header_t),
  3369		    DDI_DMA_SYNC_FORDEV);
  3370	
  3371		/* Indicate to the HBA that a command is active. */
  3372		ddi_put32(ahci_ctlp->ahcictl_ahci_acc_handle,
  3373		    (uint32_t *)AHCI_PORT_PxCI(ahci_ctlp, port),
  3374		    (0x1 << slot));



A difference between build 111 and build 112 is that ahci_find_dev_signature()
is using ahci_software_reset() instead of ahci_port_reset().

With build 111 ahci_software_reset() is not called at boot time.

When I revert that part of the 6753962 changes, the N4L-VM can be booted:

diff --git a/usr/src/uts/common/io/sata/adapters/ahci/ahci.c b/usr/src/uts/commo
n/io/sata/adapters/ahci/ahci.c
--- a/usr/src/uts/common/io/sata/adapters/ahci/ahci.c
+++ b/usr/src/uts/common/io/sata/adapters/ahci/ahci.c
@@ -3846,25 +3846,21 @@ ahci_find_dev_signature(ahci_ctl_t *ahci

 	ahci_portp->ahciport_device_type = SATA_DTYPE_UNKNOWN;

-	/* Issue a software reset to get the signature */
-	if (ahci_software_reset(ahci_ctlp, ahci_portp, port)
-	    != AHCI_SUCCESS) {
-		AHCIDBG1(AHCIDBG_INFO, ahci_ctlp,
-		    "ahci_find_dev_signature: software reset failed "
-		    "at port %d. cannot get signature.", port);
-		ahci_portp->ahciport_port_state = SATA_PSTATE_FAILED;
+	/* Call port reset to check link status and get device signature */
+	(void) ahci_port_reset(ahci_ctlp, ahci_portp, port);
+
+	if (ahci_portp->ahciport_device_type == SATA_DTYPE_NONE) {
+		AHCIDBG1(AHCIDBG_INFO, ahci_ctlp,
+		    "ahci_find_dev_signature: No device is found "
+		    "at port %d", port);
 		return;
 	}

-	/*
-	 * ahci_software_reset has started the port, so we need manually stop
-	 * the port again.
-	 */
-	if (ahci_put_port_into_notrunning_state(ahci_ctlp, ahci_portp, port)
-	    != AHCI_SUCCESS) {
-		AHCIDBG1(AHCIDBG_INFO, ahci_ctlp,
-		    "ahci_find_dev_signature: cannot stop port %d.", port);
-		ahci_portp->ahciport_port_state = SATA_PSTATE_FAILED;
+	/* Check the port state */
+	if (ahci_portp->ahciport_port_state & SATA_PSTATE_FAILED) {
+		AHCIDBG2(AHCIDBG_ERRS, ahci_ctlp,
+		    "ahci_find_dev_signature: port %d state 0x%x",
+		    port, ahci_portp->ahciport_port_state);
 		return;
 	}
Further analysis and solution provided by the submitter.
The suggested fix from submitter is attached in [ahci-linkpm.txt]

---

I think I now have an explanation what happens on the ASUS N4L-VM DH:
The ASUS BIOS configures "Aggressive Link Power Management" on the
AHCI ports before it passes control to the OS bootstrap loader !

When I start a current onnv-gate release kernel with option "-kd",
step over the startup() call in main() and look at the AHCI P0CMD
register, I see:

[0]> ffa3bc00+100+18\X
0xffa3bd18:      c040006

(AHCI memory mapped registers start at physical address ffa3bc00 on my box)


The P0CMD register has bits 26 (P0CMD.ALPE) and 27 (P0CMD.ASP) set.


Now the code in ahci_initialize_port() always disables link power management,
at line 2964, but that happens *after* ahci_find_dev_signature() at
line 2934, so
ahci_find_dev_signature() runs with enabled aggressive link power
management on the ASUS N4L-VM DH.

  2923		/*
  2924		 * At the time being, only probe ports/devices and get the types of
  2925		 * attached devices during DDI_ATTACH. In fact, the device can be
  2926		 * changed during power state changes, but at the time being, we
  2927		 * don't support the situation.
  2928		 */
  2929		if (ahci_ctlp->ahcictl_flags & AHCI_ATTACH) {
  2930			/*
  2931			 * Till now we can assure a device attached to that HBA port
  2932			 * and work correctly. Now try to get the device signature.
  2933			 */
  2934			ahci_find_dev_signature(ahci_ctlp, ahci_portp, port);
  2935		} else {
  2936
  2937			/*
  2938			 * During the resume, we need to set the PxCLB, PxCLBU, PxFB
  2939			 * and PxFBU registers in case these registers were cleared
  2940			 * during the suspend.
  2941			 */
  2942			AHCIDBG1(AHCIDBG_PM, ahci_ctlp,
  2943                      "ahci_initialize_port: port %d "
  2944                      "reset the port during resume", port);
  2945			(void) ahci_port_reset(ahci_ctlp, ahci_portp, port);
  2946
  2947			AHCIDBG1(AHCIDBG_PM, ahci_ctlp,
  2948                      "ahci_initialize_port: port %d "
  2949                      "set PxCLB, PxCLBU, PxFB and PxFBU "
  2950                      "during resume", port);
  2951
  2952			/* Config Port Received FIS Base Address */
  2953			ddi_put64(ahci_ctlp->ahcictl_ahci_acc_handle,
  2954                      (uint64_t *)AHCI_PORT_PxFB(ahci_ctlp, port),
  2955
ahci_portp->ahciport_rcvd_fis_dma_cookie.dmac_laddress);
  2956	
  2957			/* Config Port Command List Base Address */
  2958			ddi_put64(ahci_ctlp->ahcictl_ahci_acc_handle,
  2959                      (uint64_t *)AHCI_PORT_PxCLB(ahci_ctlp, port),
  2960
ahci_portp->ahciport_cmd_list_dma_cookie.dmac_laddress);
  2961		}
  2962	
  2963		/* Disable the interface power management */
  2964		ahci_disable_interface_pm(ahci_ctlp, port);


Most likely the enabled aggressive link power management is responsible for
the "phyrdy change" that I observe.


The fix is probably to move the ahci_disable_interface_pm() up a few lines,
before ahci_find_dev_signature() runs.  I'm now using an ahci module
changed with the attached patch.  WIth that change onnv-gate is able to
boot on the ASUS N4L-VM DH mainboard.