OpenSolaris

Printable Version Enter a New Search
Bug ID 6802889
Synopsis ioat_cmd_post() panics with "mutex not held"
State 10-Fix Delivered (Fix available in build)
Category:Subcategory kernel:tcp-ip
Keywords 2009.06-reviewed
Responsible Engineer Mark Johnson
Reported Against snv_101
Duplicate Of
Introduced In solaris_nevada
Commit to Fix snv_112
Fixed In snv_111a
Release Fixed solaris_nevada(snv_111a)
Related Bugs
Submit Date 9-February-2009
Last Update Date 21-April-2009
Description
While running SunRay 4.1 on OpenSolaris 2008.11 on an X2250 system we have had the system panic multiple times with this stack trace :

unix: panicsys+0x9b (0xfffffffffb916520,0xffffff003ee18650,panic_stack+0x1f10,1
)
unix: vpanic+0x15d (0xfffffffffb916520,0xffffff003ee18650,0xffffff0910da0680,0,
0xffffff003ee185a0,0xffffff003ee18690)
unix: panic+0x94 ()
unix: mutex_panic+0x73 ("mutex_exit: not owner",0xffffff0910da0680)
unix: mutex_vector_exit+0x41 (0xffffff0910da0680)
unix: mutex_exit (?)
ioat: ioat_cmd_post+0x225 (0xffffff090e9eb928,0xffffff0926d62080)
dcopy: dcopy_cmd_post+0x56 (0xffffff0926d62080)
unix: unix`dcopy_cmd_post (0xffffff0926d62080)
genunix: uioamove+0x1b2 (0xffffff09144a505c,0xd5,UIO_READ,0xffffff096da4eae8)
genunix: struioainit+0x7b (0xffffff09164aed20,0xffffff096da4eab0,
0xffffff096da4eae8)
genunix: strget+0x2af (0xffffff096db72320,0xffffff09164aed20,0xffffff096da4eae8,
1,0xffffff003ee18ae8)
genunix: kstrgetmsg+0x2ed (0xffffff094d713c40,0xffffff003ee18c00,
0xffffff096da4eae8,0xffffff003ee18bf3,0xffffff003ee18bec,0xffffffffffffffff,
0xffffff003ee18bf8)
sockfs: sotpi_recvmsg+0x392 (0xffffff094d712d48,0xffffff003ee18c60,
0xffffff003ee18e20)
sockfs: socktpi_read+0x79 (0xffffff094d713c40,0xffffff003ee18e20,0,
0xffffff0974911d50,0)
genunix: fop_read+0x6b (0xffffff094d713c40,0xffffff003ee18e20,0,
0xffffff0974911d50,0)
genunix: read+0x2b8 (0x3c,0xf3939770,0x2000)
genunix: read32+0x22 (0x3c,0xf3939770,0x2000)
unix: _sys_sysenter_post_swapgs+0x14b ()


Digging into the dump a little deeper and we find that ioat_cmd_post() is trying to
release a mutex in response to the first of it's error tests :

    912         channel = (ioat_channel_t)private;
    913         priv = cmd->dp_private->pr_device_cmd_private;
    914 
    915         state = channel->ic_state;
    916         ring = channel->ic_ring;
    917 
    918         mutex_enter(&ring->cr_desc_mutex);
    919 
    920         /* if the channel has had a fatal failure, return failure */
    921         if (channel->ic_channel_state == IOAT_CHANNEL_IN_FAILURE) {
    922                 mutex_exit(&ring->cr_cmpl_mutex);
    923                 return (DCOPY_FAILURE);
    924         }
    925 
    926         /* make sure we have space for the descriptors */
    927         e = ioat_ring_reserve(channel, ring, cmd);
    928         if (e != DCOPY_SUCCESS) {
    929                 mutex_exit(&ring->cr_cmpl_mutex);
    930                 return (DCOPY_NORESOURCES);
    931         }
    932 
    933         /* if we support DCA, and the DCA flag is set, post a DCA desc */


Note that on line 918 we take the cr_desc_mutex, but in the error return we are trying to release the cr_cmpl_mutex, which is why we take the panic.

The same problem exists on line 929 in the other error return.
Although the panic fix is quite simple... The real question is why did the
H/W have a fatal error... There should have been a message send to syslog
for this failure before the panic..

                cmn_err(CE_WARN, "channel(%d) fatal failure! "
                    "chanstat_lo=0x%X; chanerr=0x%X\n",
                    channel->ic_chan_num, status, chanerr);
                channel->ic_channel_state = IOAT_CHANNEL_IN_FAILURE;

Do you have a panic dump?  If so, what is the above error say if you
do a ::msgbuf?
There are a couple of dumps available - if you login to upland.uk they are under /var/crash.

The msgbuf shows a number of :

WARNING: channel(0) fatal failure! chanstat_lo=0x1039D83; chanerr=0x2
Work Around
disable ioat in BIOS
Comments
N/A