A story of EADDRINUSE and ECONNRESET errors
The ECONNRESET
error was silenced before node.js 0.8.20 but not it is not (which is good). You think: Hmm, but how many times I may see it. Not very often. Wrong! It may happen more often than you think, as many load balancers use it to close connection (on purpose not in a clean way!).
Have you ever seen EADDRINUSE
error in node.js? Almost everyone have seen it. But I’m not thinking about the one related to listen()
method when you see when trying to start a server. I’m thinking about the connect
one, when you try to do outbound connection. It happens only when you have a big outbound traffic, especially HTTP(S) one.
You may not think that those error have something to do with nasty internals of TCP, but they do… And in Windows Azure Web Sites there is only one universal way of getting rid of it…
If you do not want to read everything, go to My recommendations.
The background
Let’s assume you have a node.js server running on Windows Azure Web Sites and is using SQL Azure, Azure Storage Table, Apple Push Server, Google GCM and Facebook API. You start to see ECONNRESET connect
errors when traffic starts to get bigger (> 20 req/s), but it only happens for Azure Table Storage requests (many times there is several of them for every incoming request).
Looking for solution - first try
So, ECONNRESET
starts to appear. Have I just upgraded to node.js >= 0.8.20. Yes, so we have to handle it. But wait! This error means connection was not dropped in a clean way, but way? Do I have long connections? Table Store is using HTTP(S) so it should make a connection and then close it. In node.js 0.8 and 0.10 this is true, but only when there is less than 5 concurrent connections. Gotcha!
Simplest thing to do may be to increase maxSockets
value in global Agent (Agent
object in node.js is responsible for socket reuse in HTTP connections). Increasing it from 5 to 200 will help in two ways: each connection will be closed after use (no ECONNRESET
) and more throughput (6th request is not waiting for free socket).
There is one other way - you can sign out of connection pooling by setting agent
on request to false
. Many libs do this. I’ll explain later why this is bad idea (in high traffic scenarios, otherwise it doesn’t hurt).
Warning! Agent
will work differently in node 0.12 and above (see this issue). This will work properly in more situations, but default is still not ideal and some problems may still occur, but less often.
The result
The result of increasing the maxSockets
- instead of ECONNRESET
there is now EADDRINUSE
as error and the CPU usage have increased (making a secure socket is costly).
Looking for solution - second try
Why EADDRINUSE
happens? After short research I understood it means there is no available port for outbound connections. What? No available ephemeral ports?! There are thousands of them and I’m not making that many connections from one VM, so why I hit the limit? Welcome to TIME_WAIT
hell.
I won’t explain here what TIME_WAIT
is. There is a great article explaining it fully. What is important that socket port closed properly by server (I made a connection and I closed it on my end) will wait 4 minutes and cannot be reused (to connect to the same server). In fully controlled VM you can lower the time to say 1 minute and also increase the pool of ephemeral ports. None of those tricks are possible in Windows Azure Web Sites. So what to do?
To library creators: never sign out off connection pool without option to change it because in high traffic scenario I have to patch your code because of TCP TIME_WAIT
state!
We have to use connection pool. It means we may get ECONNRESET
errors, but why they happen so often? Because load balancers do not like long running idle connections and because they do not want to have the TIME_WAIT
issue (as the closing side), they close connection abnormally. Windows Azure load balancer does this after 60 seconds of inactivity.
The maxSockets dilemma
So we know we want to have a connection pool and we have to handle ECONNRESET
errors (it is best to create new socket and retry the request and then move free socket to pool). But how to set maxSockets
value properly? Setting it to infitinty may seems right, but in Web Sites one VM may run many sites, so it is best to set the limit. It cannot be too low too, as we once again get the socket throttling.
The best solution in Web Sites seems to have sockets automatically close when idle for 55 seconds (to maximally limit ECONNRESET
issue and to free ephemeral ports if they are not needed) and set max sockets limit to 500 or similar high value (but not extremely high, so throttle may still occur, but now as safety element). Of course set maxSockets
to lover value if upstream server is not able to handle such load (Azure Table Storage can do several thousands).
My recommendations
Below points are true for 0.8 and 0.10 versions of node.js. Except ECONNRESET
errors (and maybe idle time based closing), they won’t be needed in 0.12 and above. Take a look at maxFreeSockets
value in new Agent
.
- If you write your own library, do not turn off connection pooling. Allow to set custom
Agent
or at least use the global one (default behaviour). - Retry request on
ECONNRESET
errors if needed. It is best to use some generic module for it. - Limit connection idle time to 55 seconds (especially in Windows Azure Web Sites). This means using your own
Agent
. - Set
maxSockets
to high (but not too high) value as a safety limit.