Category
solaris/ppp (Solaris Networking)
Sub-Category
ppp_kernel
Description
After connecting using ppp, sppp0 is established, and works to a point. I can ping, and do some generally low bandwidth things, like ssh or telnet. I can even load some web pages.
But - as soon as I attempt to load a number of pages at the one time, the ppp interface stops working.
In addition to this, once it's stopped working, an ifconfig -a will hang part way through displaying the sppp0 interface entry.
I have verified using an external host that when the issue occurs, we are indeed no longer sending packets out.
For what it's worth, I'm more than happy to collect whatever data you might like me to collect.
Frequency
Always
Regression
No
Steps to Reproduce
Start ppp interface, connect to theregister.co.uk and 'open in new tab' a bunch of the stories. by the time you have started opening the 6th tab, it'll be stopped.
If there is an already running snoop, it can been seen that there are still requests going out from snoops point if view, but no return packets.
Expected Result
One might expect that it would actually keep working and display the requested pages.
Actual Result
sppp0 interface stops working, thus, webpages etc all stop responding.
To restart the interface, one needs to pkill pppd (at times with a -9), possibly clean up interfaces with ifconfig unplumb then re-connect.
Error Message(s)
No error messages as such - Things just stop progressing.
Test Case
N/A
Workaround
Additional configuration information
Opensolaris 2008.11, updated to kernel 111a using pkg image-update from pkg.opensolaris.org/dev
Nokia 6310 USB connected phone
Phone, cable and service provider demonstrated to work fine on a different laptop running Nevada 110.
I thought it might have been a hardware issue, however, I first encountered this issue on an HP tx2-1015au, which is ATI chipset, AMD cpu, and have since encountered it on an HP (compaq) 2510p, which is an all intel affair.
More oddly, it seems to work just fine in real nevada (sxce) build 110 on an MSI Megabook S270, but not in OpenSolaris on either of my HP units.
PPP is stuck because USB is stuck. Looking at the stream in the
attached system dump, I see this:
> 0xffffff014f8ef140::queue
ADDR MODULE FLAGS NBLK
ffffff014f8ef140 usbsacm 244020 449 ffffff015dec6700
There are hundreds of messages on the queue, and nothing is moving.
Looking at the flags and transmit state on the driver, I see this:
> 0xffffff0154266900::print usbser_port_t port_state port_flags port_wq_data_cnt port_wq_thread port_flowc
port_state = 0x4
port_flags = 0x44
port_wq_data_cnt = 0xfc78
{
port_wq_thread.thr_cv = {
_opaque = 0x1
}
port_wq_thread.thr_flags = 0x1
port_wq_thread.thr_port = 0xffffff0154266900
port_wq_thread.thr_func = usbser_wq_thread
port_wq_thread.thr_arg = 0xffffff0154266958
}
port_flowc = 0
In other words, the port is open, no flow control has been asserted,
and the write queue thread is just sleeping away. I have to guess
that either the hardware itself is broken and has simply stopped
transmitting, or that the complex thread wakeup logic in usbser.c
("GSD") is broken.
I suspect the latter, as there've been *many* reports now of people
with stuck USB serial ports, and it seems unlikely that lots of people
all have defective gear.
Plus, the same hardware is known to work on other systems.
Thus, transferring over to USB team to investigate.
Work Around
There's report that if mulit-core is disabled(see psradm(1M)), there will be no hang. I don't have chance to verify it. But this could be a choice before the root cause is found.
Comments
Transferring to ppp/ppp_kernel initially for investigation. I
suspect, though, that this is another USB-related problem, as
we've seen several reports of USB trouble.
I sent a request to the submitter for more information on the
system (kernel) state at the time of the hangs.
From the coredump, we can see that the usbsacm w_queue is pending on a condition variable, which is the write queue thread's cv, thr->thr_cv:
ffffff0004abfc60 fffffffffbc2c030 0 0 60 ffffff0154266958
PC: _resume_from_idle+0xf1 TASKQ: usbsacm_usbser_taskq
stack pointer for thread ffffff0004abfc60: ffffff0004abfb30
[ ffffff0004abfb30 _resume_from_idle+0xf1() ]
swtch+0x147()
cv_wait+0x61()
usbser_wq_thread+0x64()
taskq_thread+0x193()
thread_start+8()
The thr_flags is 0x01, THR_RUNNING and no USBSER_THR_WAKE flag set, which causes the usbser_wq_thread() to step into the cv_wait branch. The usbser_thr_wake() is called in various places to set USBSER_THR_WAKE flag. usbser_rsrv, usbser_wsrv, usbser_tx_cb, usbser_status_cb, usbser_restart or usbser_restore_ports_state will call this function to trigger sending signal to the pending cv.
Both usbser_wq_thread and usbser_rq_thread could be waken to do further process and hence clear USBSER_THR_WAKE flag. There seems to be some competition.
Submitter confirms that disabling a core like this:
psradm -f 1
... will work around the problem. Reenabling the core causes
the link to choke.
usbser_wq_thread() is used to serve write queue messages. This thread is blocked for new messages, data transmit completion, status callback or break/delay expiration. That is, these funcations, usbser_wsrv, usbser_tx_cb, usbser_status_cb and usbser_restart, can wake up usbser_wq_thread.
1).usbsacm doesn't use status callback. Hence, it has nothing to do with usbser_status_cb here.
2).usbser_wsrv can always send a wake up signal as long as the port is still open.
3).TCSBRK ioctl or M_BREAK msg can cause the usbsacm send break request to device, if that device support break. If successful, usbser_restart is called after a while to wake up usbser_wq_thread.
4). usbser_tx_cb is called by DSD after completion of a message. As long as device is still online and port is open and not busy(doing tx,break,delay or ctrl), it will wake up the usbser_wq_thread.
It can be seen from the coredump that the device is online, port is open(port_state = 0x04) and not busy(port_act = 0). There seems no reason the usbser_wq_thread is blocked on a cv.
Every DSD has its own queue to hold data messages from GSD.
> 0xffffff014f872c80::print usbsacm_port_t acm_rx_mp acm_tx_mp acm_bulkin_state acm_bulkout_stat
e acm_mctlout acm_mctlin
acm_rx_mp = 0 <-- No data pending on rx queue
acm_tx_mp = 0 <-- No data pending on tx queue
acm_bulkin_state = 0x2 <-- busy
acm_bulkout_state = 0x1 <-- idle
acm_mctlout = 0x3 <-- USB_CDC_ACM_CONTROL_DTR & USB_CDC_ACM_CONTROL_RTS
acm_mctlin = 0
The usbsacm still functions normally. No data messages is pending on it. All the messages are pending at the GSD's queue.
Copied from my eamil to code reviewers:
-------- ¿¿¿¿ --------
¿¿: code review for CR6839931,6840438 and 6840063
¿¿: Wed, 27 May 2009 14:49:53 +0800
¿¿¿: Lei Chen
¿¿¿: usb-team
Hi team,
Please help me review the fix for CR6839931 and 6840438.
In the webrev, I also made a preliminary change to usbser.c to fix CR6840063. I'm still not sure of the root cause. But I think usbser should wake up the wq_thread whenever the GSD is not doing any TX or control operation. The GSD should not check its RX status because this could block further TX. Besides, the current implementation wakes up tx_thread at an unknown time by usbser_wsrv when a mblk is sent from upper module. I changed it to wake up the thread immediately after a mblk is put on write queue. I have sent the fix to submitter to verify. Please correct me if you think the above logic is not suitable.
http://greatwall.prc/~lc152243/CR6839931_6840438/
http://greatwall.prc/~lc152243/CR6840438_closed/
Thanks,
Lei Chen
From the attached log, we can see it proves my guess that the usbser_wq_thread is not waked up because of the USBSER_ACT_RX is set in pp->port_act, i.e.:
line 5067-- usbs[0].0: usbser_tx_cb: act=3 curthread=ffffff000fd03c60
The set of USBSER_ACT_RX bit prevents usbser_wq_thread from waking up. After this line, no usbser_tx_cb is called anymore. This means no data get sent out after this line though the messages continue to be pushed from upper modules. Only when the usbser_wsrv is called, the usbser_wq_thread wakes up again.