|
Description
|
When fixing bug 5106644 with xge, I found there was a problem with buffers allocated by desballoc(). As soon as such buffers have been loaned to up layer, GLD and nic driver will lose trace to them. So user can unplumb interface and remove driver freely. System panics when loaned buffers are released after unplumbing/rem_drv.
I could produce it easily with modified ttcp(provide by Jerry Chu):
(You can find this tool here: /net/jurassic.sfbay/export/rahi/hkchu/tools/ttcp/i386)
1. Start receive side `ttcp` with 1MB receive window and 10 connections by `ttcp -s -blm -M 10 -r`.
2. Start xmit side `ttcp` on the other machine by `ttcp -s -b1m -M 10 -t remotehost`.
3. `Ctl-Z` to suspend receive side ttcp. (Make sure the receive ttcp is holding some buffers)
4. Do `ifconfig xge0 unplumb` and `rem_drv xge` on receive side.
5. Resume receive side ttcp. System will panic.
I've reproduced panics with e1000g and ixgb. This is a race condition panic.
Here is the coredump stack:
XGE:
panic[cpu2]/thread=ffffffff881a1ae0:
mutex_enter: bad mutex, lp=ffffffff8a95cb80 owner=30000000181e8 thread=ffffffff8
81a1ae0
fffffe800098eaa0 unix:mutex_panic+71 ()
fffffe800098eb00 unix:mutex_vector_enter+56 ()
fffffe800098eb20 xge:xgell_rx_recycle+1a ()
fffffe800098eb40 genunix:dblk_lastfree_desb+1a ()
fffffe800098eb70 genunix:freeb+7d ()
fffffe800098ebb0 genunix:struiocopyout+83 ()
fffffe800098ec50 genunix:kstrgetmsg+6e5 ()
fffffe800098ed10 sockfs:sotpi_recvmsg+1a8 ()
fffffe800098ed70 sockfs:socktpi_read+77 ()
fffffe800098ed80 genunix:fop_read+b ()
fffffe800098eeb0 genunix:read+2b4 ()
fffffe800098eec0 genunix:read32+e ()
fffffe800098ef10 unix:sys_syscall32+101 ()
E1000g:
panic[cpu0]/thread=ffffffff80bfd8a0:
BAD TRAP: type=d (#gp General protection) rp=fffffe8000791a20 addr=feea8414
ttcp.jerry:
#gp General protection
addr=0xfeea8414
pid=478, pc=0xfffffffffb9cf545, sp=0xfffffe8000791b10, eflags=0x10282
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
cr2: feea8414 cr3: f114000 cr8: c
rdi: deadbeefdeadbeef rsi: ffffffff87fa2900 rdx: ffffffff844b4028
rcx: 0 r8: fffffe8000791c14 r9: ffffffff80bfd8a0
rax: 0 rbx: ffffffff87fa2900 rbp: fffffe8000791b20
r10: fb0082c51b6001ff r11: 0 r12: ffffffff87fa2900
r13: 0 r14: fffffe8000791e50 r15: fffffe8000791c14
fsb: ffffffff80000000 gsb: fffffffffbc22920 ds: 43
es: 43 fs: 0 gs: 1c3
trp: d err: 0 rip: fffffffffb9cf545
cs: 28 rfl: 10282 rsp: fffffe8000791b10
ss: 30
fffffe8000791930 unix:real_mode_end+48c1 ()
fffffe8000791a10 unix:trap+913 ()
fffffe8000791a20 unix:_cmntrap+13f ()
fffffe8000791b20 genunix:dblk_lastfree_desb+15 ()
fffffe8000791b50 genunix:freeb+7b ()
fffffe8000791b90 genunix:struiocopyout+5c ()
fffffe8000791c40 genunix:kstrgetmsg+77e ()
fffffe8000791d10 sockfs:sotpi_recvmsg+253 ()
fffffe8000791d70 sockfs:socktpi_read+87 ()
fffffe8000791d80 genunix:fop_read+b ()
fffffe8000791eb0 genunix:read+19a ()
fffffe8000791ec0 genunix:read32+e ()
fffffe8000791f10 unix:sys_syscall32+101 ()
Looking through the code path, GLD(both v2 and v3) doesn't know if desballoc buffers from driver are still held by upper layer. A call to mac_stop() doesn't provide any capability for a return value. The nic driver doesn't count buffers and doesn't prevent driver unload either. It's a defect of nic driver, while GLD can check the return value from mac_stop to help out.
My suggestions on the fix:
1. Fix it in nic driver. Nic drivers, those are using (d)esballoc buffers, should count the number has been sent up. Driver should check to make sure all buffers have been freed before getting itself unloaded. It's the way to avoid such problem in nic driver.
2. Fix it both with nic driver and GLD. GLD could check return value of mac_m_stop() or waiting in mac_stop() for driver's response for "ready to unplumb", and nic driver could count the buffers and reject to be unplumbed immediately. According to nemo spec, the interface should be "in a reset/quiesced state that interface can be unregistered". I prefer it this way.
3. Enhance GLD to manage such memory used by nic driver. Most zero-copied nic driver have to handle esballoc buffer carefully, and the way is similar. If GLD can handle such buffers and get them recycled out of nic driver, the problem with such buffer should be eliminated. And new nic driver can be composed more easily. Is it Nemo's aim?
Refer to bug 5106644 to know the other common bug about esballoc buffer in GLD nic drivers.
|