|
Description
|
When run under CentOS 3.8 within Brandz, Emacs locks up after opening
an initial X11 window. A trace of the system call occuring during the lock
is attached.
STEPS TO DUPLICATE
-------------------------
Install CentOS 3.8 within Brandz, and the 'emacs' editor. Ssh into the centos
system with X11 forwarding and run the 'emacs' editor.
EXPECTED VS ACTUAL RESULTS
-------------------------------------
(No Answer)
ERROR MESSAGES
---------------------
(No Answer)
SOURCE CODE
------------------
(No Answer)
SYSTEM INFORMATION
--------------------------
Hardware Platforms (AMD64-1, Non-Sun, ASUS K8V Delux, AMD Athlon., x64, 1GB, 100GB or more)
Operating Platforms (Sol1, Solaris 10 7/07, Solaris, JDS 3 Sun Java Desktop System Release 3, English)
The synopsis is a bit misleading. emacs isn't "locked up" so much
as it is spinning in a useless infinite loop.
The attached strace output matches the truss output I get. An
infinite series of:
[...]
read(4, "0102 :EA\0\0\0\0AE13FC\0".., 32) = 32
write(4, " +\001\0", 4) = 4
fstat64(4, 0x080469E0) = 0
read(4, 0x08046C10, 32) Err#11 EAGAIN
uucopy(0x08046AF0, 0x080469A0, 1) = 0
pollsys(0x080468E0, 1, 0x00000000, 0x00000000) = 1
uucopy(0x080469A0, 0x08046AF0, 1) = 0
fstat64(4, 0x080469E0) = 0
read(4, "0102 ;EA\0\0\0\0AE13FC\0".., 32) = 32
[...]
pstack shows:
negril# pstack 18423
18423: emacs
ceaddbc5 pollsys (80468e0, 1, 0, 0)
cea982ee pselect (5, 80469a0, ceb28260, ceb28260, 0, 0) + 19e
cea985fe select (5, 80469a0, 0, 0, 0) + 7e
cebd3132 lx_select (5, 8046af0, 0, 0, 0, 8046b88) + 1c6
cebcebbe lx_emulate (8046a6c) + 1f6
cebe08dd lx_handler (ce61cce4, 4, 0, ce571c6f, 5, 8046af0) + 51
ce498038 select (8495ff0, 8046c10, 20, 0, ce390b60, 0) + 28
ce572bc7 _XRead (8495ff0, 8046c10, 20, 80469a0, ceb28260, ceb28260) + a7
ce57370d _XReply (8495ff0, 8046c10, 0, 1, baa00201, 0) + bd
ce56eb25 XSync (8495ff0, 0, 8046ca8, 80bdc48, 8436d50, 2000070) + 65
080c9b2e ???????? (8436d50, 2000070, 0, 0, 821840c, 4821849c)
080bdc48 ???????? (8436d50, 8046d54, 1, 482183d8, 8046d50, 8195330)
08059d3d ???????? (48436d50, 1828b42c, 8046d54, 8139fb5, 1828b42c, 583fdadc)
08139f9b ???????? (2, 8046d50, 8046db8, 8164b66, 1829927c, 48436d50)
08164b03 ???????? (38218258, 482182a8, 5, 18420964, 8046e80, 0)
0813a39e ???????? (4821822c, 1, 8046e84, 182e3c8c, 74, 8046e70)
08139eac ???????? (2, 8046e80, 8046ef8, 813ee87, 18420964, 583fe2c4)
08164b03 ???????? (3820d0a4, 4820d0c8, 3, 1833c0bc, 8046fa0, 0)
0813a39e ???????? (4820d074, 1, 8046fa4, 8297c60, 1828b42c, 481d05dc)
08139eac ???????? (2, 8046fa0, 8297c60, 0, 1833c0bc, 583fe2c4)
08164b03 ???????? (3820c254, 4820c2bc, 4, 18419ccc, 80470c0, 0)
0813a39e ???????? (4820c238, 0, 80470c4, 8297c60, 1828b42c, 0)
08139eac ???????? (1, 80470c0, 8297c60, 0, 18419ccc, 18419ccc)
08164b03 ???????? (38231e64, 4823234c, 5, 1844840c, 80471f0, 0)
0813a39e ???????? (48231e4c, 0, 80471f4, 8297c60, 8047134, 0)
08139eac ???????? (1, 80471f0, 8297c60, 0, 1844840c, 3845a844)
08164b03 ???????? (38230f04, 48230f94, 5, 3, 80472d0, 0)
0813a39e ???????? (48230eec, 0, 80472ac, 816b0e0, 14000000, ce3ea0d8)
0813a25c ???????? (48230eec, 1828b42c, 1, 8047334, cebd8b8c, 11)
081391b2 ???????? (5839f514, 0, 8047508, 813811a, 8047430, 0)
080dd7e3 ???????? (8047430, 0, 804be63, 20, 182a2c84, 1828b42c)
0813811a ???????? (80dd7d0, 182a2c84, 80dd480, 0, ce3e9cd0, ceb7d710)
080dd820 ???????? (1828b42c, 80475d0, 8047540, 0, 18299674, 1828b42c)
08137c89 ???????? (18299674, 80dd7f0, 1828b42c, 0, 1, 804772c)
080dd749 ???????? (182c122c, 1828b42c, 8188280, 8124030, 0, 0)
080dd20e ???????? (80dd370, 582e27ac, 0, 32, 0, 0)
080dd343 ???????? (8188925, 0, 1, 0, 3, 804772c)
080dbb79 ???????? (1, 80479fc, 8047a04, 0, ce4f7ab8, ceb7d020)
ce3d779a __libc_start_main (80db290, 1, 80479fc, 818278c, 81827d4, ceb74cc0) + da
0804f451 XMapRaised () + 39
It may be worth noting that this happens with standard X remote display
as well as X forwarding over ssh.
The strace log included with the bug report is truncated. It shows the
endless spinning, but not what led up to it. By comparing the full strace
output on BrandZ and a real Linux system, one interesting difference is
that the Linux instance periodically receives SIGIOs while the BrandZ
instance does not.
Most of the SIGIOs received by the process are also sent by the process,
with a kill() system call. The exception is the first SIGIO, which arrives
almost immediately after the Linux process sets F_SETOWN on the X socket.
No such signal is received after setting the same flag on the BrandZ
process.
If I manually send the BrandZ emacs process a SIGIO, it unhangs and starts
sending/receiving the same stream of SIGIOs
So, the question now is: why are we not receiving that first SIGIO?
If I run the following client-server program on a linux box, the client gets
a SIGIO promptly; whereas under brandz it just goes on:
beach> cat eserv.c
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/socket.h>
struct sockaddr saddr2;
int
main(int a, char* av[])
{
unsigned int sd, sd2;
int len, flags = 0;
struct sockaddr saddr;
const char data[] = "/tmp/.emacs";
char buf[256];
int sz;
sd = socket(PF_UNIX, SOCK_STREAM, 0);
saddr.sa_family = AF_UNIX;
snprintf(saddr.sa_data, sizeof(data), "%s", data);
len = strlen(saddr.sa_data) + sizeof(saddr.sa_family);
bind(sd, &saddr, len);
listen(sd, 5);
sz = sizeof(struct sockaddr);
for (;;) {
sd2 = accept(sd, &saddr2, &sz);
while (len = recv(sd2, &buf, 100, 0), len>0)
send(sd2, &buf, len, 0);
sleep(1);
}
}
beach> cat eclnt.c
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <sys/socket.h>
#include <sys/filio.h>
char buf[]="this program is to test the emacs locking up in brandz envv";
char rbuf[512];
struct pollfd pfd;
int
main(int a, char* av[])
{
unsigned int sd;
int len, flags = 0;
struct sockaddr saddr;
const char data[] = "/tmp/.emacs";
int i = 0;
int ret;
pid_t pid = getpid();
int on=1;
sd = socket(PF_UNIX, SOCK_STREAM, 0);
saddr.sa_family = AF_UNIX;
sprintf(saddr.sa_data, "%s", data);
len = strlen(saddr.sa_data) + sizeof(saddr.sa_family);
connect(sd, &saddr, len);
flags = fcntl(sd, F_GETFL);
flags |= (O_NONBLOCK | O_NDELAY);
fcntl(sd, F_SETFL, flags);
while (i<1000) {
write(sd, buf, sizeof(buf));
if ((ret = read(sd, rbuf, sizeof(rbuf))) < 0) {
pfd.fd = sd;
pfd.events = POLLIN;
pfd.revents = 0;
poll(&pfd, 1, -1);
read(sd, rbuf, sizeof(rbuf));
}
if (i >= 10) {
fcntl(sd, F_SETOWN, pid);
flags |= (O_NONBLOCK | O_NDELAY | O_ASYNC); fcntl(sd, F_SETFL, flags);
}
i++;
sleep(1);
}
}
beach>
The linux man page for fcntl (http://www.linuxmanpages.com/man2/fcntl.2.php) says
<begin>
F_SETOWN
Set the process ID or process group that will receive SIGIO and SIGURG signals for events on file descriptor fd. Process groups are specified using negative values. (F_SETSIG can be used to specify a different signal instead of SIGIO).
If you set the O_ASYNC status flag on a file descriptor (either by providing this flag with the open(2) call, or by using the F_SETFL command of fcntl), a SIGIO signal is sent whenever input or output becomes possible on that file descriptor.
<end>
So SIGIO is sent to the process if F_SETOWN as well as ASYNC is set on the 'fd'.
If I comment out the F_SETOWN line in the eclnt.c, no SIGIO is sent even on a linux
system.
Now under Brandz, the F_SETOWN call fails because of 6556585.
So I had to do the binary patching as suggested in the bug report and that
passed the call.
But the SETFL call, which strace() shows as:
fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK|O_ASYNC) = 0
From global zone, trus() reports the same as :
fcntl(3, F_SETFL, FNONBLOCK) = 0
losing out the ASYNC stuff.
I believe this is cause of the problem.
The above programs run natively on solaris as well (after removing ASYNC from client
code). No SIGIO is generated even on solaris. But if I add the following line
ioctl(sd, FIOASYNC, (char *)&on);
after the fcntl(SETFL) call in client code, I do get SIGPOLL as follows :
ioctl(3, FIOSETOWN, 0xFFBFEF9C) = 0
fcntl(3, F_SETFL, 0x00000086) = 0
ioctl(3, FIOASYNC, 0xFFBFF004) = 0
...
write(3, " t h i s p r o g r a m".., 60) = 60
read(3, 0x00020DD4, 512) Err#11 EAGAIN
Received signal #22, SIGPOLL [default]
beach>
So I belive lx_fcntl_setfl() of lx/lx_brand/common/fcntl.c needs to be extended
so as not to ignore O_ASYNC.
|