OpenSolaris

Printable Version Enter a New Search
Bug ID 6630565
Synopsis ohci/ehci driver makes inappropriate use of drv_usecwait
State 10-Fix Delivered (Fix available in build)
Category:Subcategory driver:usb-ohci
Keywords
Responsible Engineer Raymond Chen
Reported Against
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_81
Fixed In snv_81
Release Fixed solaris_nevada(snv_81)
Related Bugs
Submit Date 15-November-2007
Last Update Date 6-November-2008
Description
[johansen, 15Nov2007]

Jurassic, the NFS and e-mail server for ON developers, had a ton of spew in
its logfiles yesterday evening.  We discovered that as a result of running
a debug kernel, and having dtlogin touch various devices, we were loading
the ohci driver about once a minute.  Profiling the kernel at 997Hz, showed
a ton of stacks like this:

  dtlogin
              unix`tsc_read+0x3
              genunix`gethrtime+0xd
              unix`drv_usecwait+0x98
              ohci`ohci_take_control+0xbd
              ohci`ohci_init_ctlr+0x3b
              ohci`ohci_attach+0x122
              genunix`devi_attach+0x87
              genunix`attach_node+0x123
              genunix`i_ndi_config_node+0xe1
              genunix`i_ddi_attachchild+0x67
              genunix`devi_attach_node+0xfd
              genunix`devi_config_one+0x2bd
              genunix`ndi_devi_config_one+0xd8
              devfs`dv_find+0x20f
              devfs`devfs_lookup+0x69
              genunix`fop_lookup+0xf2
              genunix`lookuppnvp+0x351
              genunix`lookuppnat+0x125
              genunix`lookupnameat+0xb9
              genunix`cstatat_getvp+0x160
             1653

Looking at the paths passed to dv_find, we're able to see that we're trying
to connect USB floppy devices:
              
  /floppy@5/disk@0,0:g                                              1
  /floppy@5/disk@0,0:h                                              1
  /floppy@5/disk@0,0:i                                              1
  /floppy@5/disk@0,0:j                                              1
  /floppy@5/disk@0,0:k                                              1
  /pci108e,cb84@2/floppy@5/disk@0,0:g                               1
  /pci108e,cb84@2/floppy@5/disk@0,0:h                               1
  /pci108e,cb84@2/floppy@5/disk@0,0:i                               1
  /pci108e,cb84@2/floppy@5/disk@0,0:j                               1
  /pci108e,cb84@2/floppy@5/disk@0,0:k                               1

(Jurassic has no floppy disk drives)

Each one of these calls results in invocations of drv_usecwait.  As the
system runs around continuously binding these interrupts to cpus, it issues
many messages like this to the log:

Nov  9 08:26:16 jurassic-x4600 pcplusmp: [ID 803547 kern.info] pcplusmp: pciclas
s,0c0310 (ohci) instance 0 vector 0x15 ioapic 0x10 intin 0x15 is bound to cpu 4

The cpu number changes, depending upon which cpu is up next.  We also manage
to keep one cpu in the kernel competely busy (100% sys in mpstat) as we
run around busy waiting for the interrupt to bind to each CPU.  A quick
dtrace of a binding from CPUs 0-15 on jurassic shows us spending 105565000
microseconds busy waiting.  That's 105 seconds of system CPU time burned
as we assign an interrupt to each CPU!

Ed Pilatowicz also took a look at this problem and offered the following comments:

        looking at the ohci driver, it seems like it's a bug that
        ohci_take_control() uses drv_usecwait().  here's the code:
        ---8<---
                /* now wait for 5 seconds for InterruptRouting to go away */
                for (wait = 0; wait < 5000; wait++) {
                        if ((Get_OpReg(hcr_control) & HCR_CONTROL_IR) == 0)
                                break;
                        drv_usecwait(1000);
                }
        ---8<---

        given the level of granularity that this code is going for it seems
        like it shouldn't be using drv_usecwait() and busy waiting.  the
        the only reasons for using drv_usecwait() are if the caller is in
        interrupt context or if the caller needs very fine granularity.
        neither appear to be the case here.  looking at the ohci driver, it
        appears to use drv_usecwait() all over the place with large timeout
        values.  it seems like it could easily use some other mechanism.
        (delay(), cv_timedwait(), or a perhaps even a custom periodic timer
        via ddi_periodic_add().)

It looks like this loop of 5 second waiting is a big part of the problem, and
is wasting a terrific amount of kernel CPU time.
Work Around
[johansen, 15Nov2007]

In our case, disabling dtlogin and/or mod_uninstall_interval eliminated the problem.
Comments
N/A