A story of EADDRINUSE and ECONNRESET errors

The ECONNRESET error was silenced before node.js 0.8.20 but not it is not (which is good). You think: Hmm, but how many times I may see it. Not very often. Wrong! It may happen more often than you think, as many load balancers use it to close connection (on purpose not in a clean way!).

Have you ever seen EADDRINUSE error in node.js? Almost everyone have seen it. But I'm not thinking about the one related to listen() method when you see when trying to start a server. I'm thinking about the connect one, when you try to do outbound connection. It happens only when you have a big outbound traffic, especially HTTP(S) one.

You may not think that those error have something to do with nasty internals of TCP, but they do... And in Windows Azure Web Sites there is only one universal way of getting rid of it...

If you do not want to read everything, go to My recommendations.

The background

Let's assume you have a node.js server running on Windows Azure Web Sites and is using SQL Azure, Azure Storage Table, Apple Push Server, Google GCM and Facebook API. You start to see ECONNRESET connect errors when traffic starts to get bigger (> 20 req/s), but it only happens for Azure Table Storage requests (many times there is several of them for every incoming request).

Looking for solution - first try

So, ECONNRESET starts to appear. Have I just upgraded to node.js >= 0.8.20. Yes, so we have to handle it. But wait! This error means connection was not dropped in a clean way, but way? Do I have long connections? Table Store is using HTTP(S) so it should make a connection and then close it. In node.js 0.8 and 0.10 this is true, but only when there is less than 5 concurrent connections. Gotcha!

Simplest thing to do may be to increase maxSockets value in global Agent (Agent object in node.js is responsible for socket reuse in HTTP connections). Increasing it from 5 to 200 will help in two ways: each connection will be closed after use (no ECONNRESET) and more throughput (6th request is not waiting for free socket).

There is one other way - you can sign out of connection pooling by setting agent on request to false. Many libs do this. I'll explain later why this is bad idea (in high traffic scenarios, otherwise it doesn't hurt).

Warning! Agent will work differently in node 0.12 and above (see this issue). This will work properly in more situations, but default is still not ideal and some problems may still occur, but less often.

The result

The result of increasing the maxSockets - instead of ECONNRESET there is now EADDRINUSE as error and the CPU usage have increased (making a secure socket is costly).

Looking for solution - second try

Why EADDRINUSE happens? After short research I understood it means there is no available port for outbound connections. What? No available ephemeral ports?! There are thousands of them and I'm not making that many connections from one VM, so why I hit the limit? Welcome to TIME_WAIT hell.

I won't explain here what TIME_WAIT is. There is a great article explaining it fully. What is important that socket port closed properly by server (I made a connection and I closed it on my end) will wait 4 minutes and cannot be reused (to connect to the same server). In fully controlled VM you can lower the time to say 1 minute and also increase the pool of ephemeral ports. None of those tricks are possible in Windows Azure Web Sites. So what to do?

To library creators: never sign out off connection pool without option to change it because in high traffic scenario I have to patch your code because of TCP TIME_WAIT state!

We have to use connection pool. It means we may get ECONNRESET errors, but why they happen so often? Because load balancers do not like long running idle connections and because they do not want to have the TIME_WAIT issue (as the closing side), they close connection abnormally. Windows Azure load balancer does this after 60 seconds of inactivity.

The maxSockets dilemma

So we know we want to have a connection pool and we have to handle ECONNRESET errors (it is best to create new socket and retry the request and then move free socket to pool). But how to set maxSockets value properly? Setting it to infitinty may seems right, but in Web Sites one VM may run many sites, so it is best to set the limit. It cannot be too low too, as we once again get the socket throttling.

The best solution in Web Sites seems to have sockets automatically close when idle for 55 seconds (to maximally limit ECONNRESET issue and to free ephemeral ports if they are not needed) and set max sockets limit to 500 or similar high value (but not extremely high, so throttle may still occur, but now as safety element). Of course set maxSockets to lover value if upstream server is not able to handle such load (Azure Table Storage can do several thousands).

My recommendations

Below points are true for 0.8 and 0.10 versions of node.js. Except ECONNRESET errors (and maybe idle time based closing), they won't be needed in 0.12 and above. Take a look at maxFreeSockets value in new Agent.

  1. If you write your own library, do not turn off connection pooling. Allow to set custom Agent or at least use the global one (default behaviour).
  2. Retry request on ECONNRESET errors if needed. It is best to use some generic module for it.
  3. Limit connection idle time to 55 seconds (especially in Windows Azure Web Sites). This means using your own Agent.
  4. Set maxSockets to high (but not too high) value as a safety limit.

Comments