|
Description
|
While running SunRay 4.1 on OpenSolaris 2008.11 on an X2250 system we have had the system panic multiple times with this stack trace :
unix: panicsys+0x9b (0xfffffffffb916520,0xffffff003ee18650,panic_stack+0x1f10,1
)
unix: vpanic+0x15d (0xfffffffffb916520,0xffffff003ee18650,0xffffff0910da0680,0,
0xffffff003ee185a0,0xffffff003ee18690)
unix: panic+0x94 ()
unix: mutex_panic+0x73 ("mutex_exit: not owner",0xffffff0910da0680)
unix: mutex_vector_exit+0x41 (0xffffff0910da0680)
unix: mutex_exit (?)
ioat: ioat_cmd_post+0x225 (0xffffff090e9eb928,0xffffff0926d62080)
dcopy: dcopy_cmd_post+0x56 (0xffffff0926d62080)
unix: unix`dcopy_cmd_post (0xffffff0926d62080)
genunix: uioamove+0x1b2 (0xffffff09144a505c,0xd5,UIO_READ,0xffffff096da4eae8)
genunix: struioainit+0x7b (0xffffff09164aed20,0xffffff096da4eab0,
0xffffff096da4eae8)
genunix: strget+0x2af (0xffffff096db72320,0xffffff09164aed20,0xffffff096da4eae8,
1,0xffffff003ee18ae8)
genunix: kstrgetmsg+0x2ed (0xffffff094d713c40,0xffffff003ee18c00,
0xffffff096da4eae8,0xffffff003ee18bf3,0xffffff003ee18bec,0xffffffffffffffff,
0xffffff003ee18bf8)
sockfs: sotpi_recvmsg+0x392 (0xffffff094d712d48,0xffffff003ee18c60,
0xffffff003ee18e20)
sockfs: socktpi_read+0x79 (0xffffff094d713c40,0xffffff003ee18e20,0,
0xffffff0974911d50,0)
genunix: fop_read+0x6b (0xffffff094d713c40,0xffffff003ee18e20,0,
0xffffff0974911d50,0)
genunix: read+0x2b8 (0x3c,0xf3939770,0x2000)
genunix: read32+0x22 (0x3c,0xf3939770,0x2000)
unix: _sys_sysenter_post_swapgs+0x14b ()
Digging into the dump a little deeper and we find that ioat_cmd_post() is trying to
release a mutex in response to the first of it's error tests :
912 channel = (ioat_channel_t)private;
913 priv = cmd->dp_private->pr_device_cmd_private;
914
915 state = channel->ic_state;
916 ring = channel->ic_ring;
917
918 mutex_enter(&ring->cr_desc_mutex);
919
920 /* if the channel has had a fatal failure, return failure */
921 if (channel->ic_channel_state == IOAT_CHANNEL_IN_FAILURE) {
922 mutex_exit(&ring->cr_cmpl_mutex);
923 return (DCOPY_FAILURE);
924 }
925
926 /* make sure we have space for the descriptors */
927 e = ioat_ring_reserve(channel, ring, cmd);
928 if (e != DCOPY_SUCCESS) {
929 mutex_exit(&ring->cr_cmpl_mutex);
930 return (DCOPY_NORESOURCES);
931 }
932
933 /* if we support DCA, and the DCA flag is set, post a DCA desc */
Note that on line 918 we take the cr_desc_mutex, but in the error return we are trying to release the cr_cmpl_mutex, which is why we take the panic.
The same problem exists on line 929 in the other error return.
Although the panic fix is quite simple... The real question is why did the
H/W have a fatal error... There should have been a message send to syslog
for this failure before the panic..
cmn_err(CE_WARN, "channel(%d) fatal failure! "
"chanstat_lo=0x%X; chanerr=0x%X\n",
channel->ic_chan_num, status, chanerr);
channel->ic_channel_state = IOAT_CHANNEL_IN_FAILURE;
Do you have a panic dump? If so, what is the above error say if you
do a ::msgbuf?
There are a couple of dumps available - if you login to upland.uk they are under /var/crash.
The msgbuf shows a number of :
WARNING: channel(0) fatal failure! chanstat_lo=0x1039D83; chanerr=0x2
|