linux on TCP connection timeout problem solution
- 2020-05-30 21:45:16
- OfStack
TCP connection timeout problem solution on linux
Recently, the problem of connection timeout often occurs on the production line. Let's see how the exception of connection timeout is generated in Java first
The timeout JAVA
java.net.SocketTimeoutException: connect timed out
Client exception :connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
We see connect timed out exceptions all the time. How does java generate this exception
plainsocketimpl. c
while (1) {
jlong newTime;
#ifndef USE_SELECT
{
struct pollfd pfd;
pfd.fd = fd;
pfd.events = POLLOUT;
errno = 0;
connect_rv = NET_Poll(&pfd, 1, timeout);
}
#else
{
fd_set wr, ex;
struct timeval t;
t.tv_sec = timeout / 1000;
t.tv_usec = (timeout % 1000) * 1000;
FD_ZERO(&wr);
FD_SET(fd, &wr);
FD_ZERO(&ex);
FD_SET(fd, &ex);
errno = 0;
connect_rv = NET_Select(fd+1, 0, &wr, &ex, &t);
}
#endif
if (connect_rv >= 0) {
break;
}
if (errno != EINTR) {
break;
}
/*
* The poll was interrupted so adjust timeout and
* restart
*/
newTime = JVM_CurrentTimeMillis(env, 0);
timeout -= (newTime - prevTime);
if (timeout <= 0) {
connect_rv = 0;
break;
}
prevTime = newTime;
} /* while */
if (connect_rv == 0) {
JNU_ThrowByName(env, JNU_JAVANETPKG "SocketTimeoutException",
"connect timed out");
/*
* Timeout out but connection may still be established.
* At the high level it should be closed immediately but
* just in case we make the socket blocking again and
* shutdown input & output.
*/
SET_BLOCKING(fd);
JVM_SocketShutdown(fd, 2);
return;
}
Here you can see that when you do connect, you call NET_Poll or NET_Select, and on linux you use poll/select
When timeout occurs, connect_rv=0, there is a point here that even though poll/select is passed into timeout, it will be interrupted. connect_rv returns a value of -1, so timeout is recalculated in jvm to ensure that timeout's time slice has been run before the loop is launched.
newTime = JVM_CurrentTimeMillis(env, 0);
timeout -= (newTime - prevTime);
if (timeout <= 0) {
connect_rv = 0;
break;
}
connect_rv is set to 0, and connect timeout is only thrown when connect_rv is 0
What is connect timeout;
That is, client sends out syn packets, server does not reply ack within the time you specify, poll/select returns 0
Why did server not reply ack? Because the reply of syn packet is in the kernel layer, either the network layer loses the packet, or the queue of back_log in the kernel layer is full, backlog will not be described in detail in this video.
At that time, the maximum number of connections on the production line could be more than 1000, and the size of queue of backlog was also checked
cat /proc/sys/net/ipv4/tcp_max_syn_backlog
There are 8192 on the production line without so many client connections that it is impossible for backlog queue to be full. Although the setting of syn_backlog is 8192, it does not mean that the server was set to 8192 when it was started, so you must check the size of backlog set by this port
ss -lt
You can see that es1064en-Q is 128 on port 8080. backlog of 128 was originally set when listen was started on the server side
View the tomcat configuration, which defaults to bio Settings
<Connector executor="tomcatThreadPool"
port="8080"
protocol="HTTP/1.1"
acceptCount="5000"
connectionTimeout="25000"
maxHttpHeaderSize="8192"
useBodyEncodingForURI="true"
enableLookups="false"
redirectPort="8443"
URIEncoding="UTF-8"
maxThreads="500"
maxKeepAliveRequests="1000"
keepAliveTimeout="30000"
/>
The production line has set acceptCount, the default is 100, but here it is set to 5000, which is seriously inconsistent with the result of send-q seen through ss
Through kernel code analysis, it is found that the original kernel parameters are not only controlled by tcp_max_syn_backlog, but also controlled by somaxconn
To view
cat /proc/sys/net/core/somaxconn
Found value is 128, OK reason found, modify /etc/ sysctl.conf add
net.core.somaxconn = 8192
sysctl-f /etc/ sysctl.conf reloads 1, so you can change the whole thing
Problem: more than 1000 connections, 500 worker threads, because the size of backlog is controlled by socket.accept, we would normally set up a single thread to serversocket.accept (), and the current server load is not high, not because there will be back_log queue full situation, let alone only more than 1000 connections, code is the truth, check the source code of tomcat.
Previously, before accept, the accptor thread would go to countUpOrWaitConnection and find that the number of socket threads received was larger than the number of work threads set, and accept would stop.
<strong>countUpOrAwaitConnection</strong>();
Socket socket = null;
try {
// Accept the next incoming connection from the server
// socket
socket = serverSocketFactory.acceptSocket(serverSocket);
} catch (IOException ioe) {
countDownConnection();
// Introduce delay if necessary
errorDelay = handleExceptionWithDelay(errorDelay);
// re-throw
throw ioe;
}
That is, when there are more than 628 concurrent connections, backlog queue may be full, and connect timeout may be full.
Thank you for reading, I hope to help you, thank you for your support of this site!