I have a multi-threaded server (thread pool) that is handling a large number of requests (up to 500/sec for one node), using 20 threads. There's a listener thread that accepts incoming connections and queues them for the handler threads to process. Once the response is ready, the threads then write out to the client and close the socket. All seemed to be fine until recently, a test client program started hanging randomly after reading the response. After a lot of digging, it seems that the close() from the server is not actually disconnecting the socket. I've added some debugging prints to the code with the file descriptor number and I get this type of output.
Processing request for 21
Writing to 21
Closing 21
The return value of close() is 0, or there would be another debug statement printed. After this output with a client that hangs, lsof is showing an established connection.
SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (ESTABLISHED)
CLIENT 17747 root 12u IPv4 32754228 TCP localhost:47530->localhost:9980 (ESTABLISHED)
It's as if the server never sends the shutdown sequence to the client, and this state hangs until the client is killed, leaving the server side in a close wait state
SERVER 8160 root 21u IPv4 32754237 TCP localhost:9980->localhost:47530 (CLOSE_WAIT)
Also if the client has a timeout specified, it will timeout instead of hanging. I can also manually run
call close(21)
in the server from gdb, and the client will then disconnect. This happens maybe once in 50,000 requests, but might not happen for extended periods.
Linux version: 2.6.21.7-2.fc8xen
Centos version: 5.4 (Final)
socket actions are as follows
SERVER:
int client_socket;
struct sockaddr_in client_addr;
socklen_t client_len = sizeof(client_addr);
while(true) {
client_socket = accept(incoming_socket, (struct sockaddr *)&client_addr, &client_len);
if (client_socket == -1)
continue;
/* insert into queue here for threads to process */
}
Then the thread picks up the socket and builds the response.
/* get client_socket from queue */
/* processing request here */
/* now set to blocking for write; was previously set to non-blocking for reading */
int flags = fcntl(client_socket, F_GETFL);
if (flags < 0)
abort();
if (fcntl(client_socket, F_SETFL, flags|O_NONBLOCK) < 0)
abort();
server_write(client_socket, response_buf, response_length);
server_close(client_socket);
server_write and server_close.
void server_write( int fd, char const *buf, ssize_t len ) {
printf("Writing to %d
", fd);
while(len > 0) {
ssize_t n = write(fd, buf, len);
if(n <= 0)
return;// I don't really care what error happened, we'll just drop the connection
len -= n;
buf += n;
}
}
void server_close( int fd ) {
for(uint32_t i=0; i<10; i++) {
int n = close(fd);
if(!n) {//closed successfully
return;
}
usleep(100);
}
printf("Close failed for %d
", fd);
}
CLIENT:
Client side is using libcurl v 7.27.0
CURL *curl = curl_easy_init();
CURLcode res;
curl_easy_setopt( curl, CURLOPT_URL, url);
curl_easy_setopt( curl, CURLOPT_WRITEFUNCTION, write_callback );
curl_easy_setopt( curl, CURLOPT_WRITEDATA, write_tag );
res = curl_easy_perform(curl);
Nothing fancy, just a basic curl connection. Client hangs in tranfer.c (in libcurl) because the socket is not perceived as being closed. It's waiting for more data from the server.
Things I've tried so far:
Shutdown before close
shutdown(fd, SHUT_WR);
char buf[64];
while(read(fd, buf, 64) > 0);
/* then close */
Setting SO_LINGER to close forcibly in 1 second
struct linger l;
l.l_onoff = 1;
l.l_linger = 1;
if (setsockopt(client_socket, SOL_SOCKET, SO_LINGER, &l, sizeof(l)) == -1)
abort();
These have made no difference. Any ideas would be greatly appreciated.
EDIT -- This ended up being a thread-safety issue inside a queue library causing the socket to be handled inappropriately by multiple threads.
See Question&Answers more detail:
os