Tuesday, 18 June 2019

Troubleshooting AWS lambda too many open files

Some time ago - in the not so distant past - I was offered the opportunity to assist a colleague with troubleshooting a system that was facing some performance issues.

Problem 1:  Dedicated caching server running out of memory and constantly swapping

This cache had been put in place to reduce the need to call out to other services for data that might be needed several times in a given time window.

The developers who had set up this system had moved on to other teams or companies, so we didn't have much context to go by as to whether there was any pattern to the distribution of requests for the data that could have hinted at a sensible expiry policy.

Solution 1:  Get a bigger boat cache

Left to rely on the hit rate metric for the cache to tell us whether or not it is actually fit for purpose, we decided to replace the caching server with the next larger instance size - and also to specify the recommended parameters for reserved memory which seemed to have been missed in the existing setup.

With the new cache in place swapping did not return as an issue.  However, errors were still showing up in the logs for the lambda that was utilising the cache.  Experience says that we should never  leave a job half-done.

Problem 2: Lambda running out of available file handles / sockets

The logs were showing a range of different types of errors that may or may not have been related:
  • Unable to perform a DNS lookup
  • Unable to load a class
  • Unable to connect to the cache
One of the error messages provided a vital hint: "too many open files".

This was my first experience of properly getting my hands dirty with AWS lambdas, so it was interesting to learn all about the way that it can keep an instance of the lambda around for much longer than the duration of the execution of a single call.

Our lambda was being called several times per minute which was enough to keep some instances of it around for long enough to run out of resources.

A bit of reading of the documentation revealed that we should be able to utilise over a thousand file handles / sockets in our lambda - which should be plenty.

Diagnosing the issue: Measuring open file handles at runtime

I had a theory at this point that the "Unable to load a class" issue may have been triggering the other issues, so I introduced some diagnostic logging that would output the total number of open file descriptors on each invocation of the lambda.  Once this was in place it appeared that around three more files / sockets were open after each invocation - so the leak had nothing to do with the occasional class loading failure.

The next revelation came when I noticed that the lambda would load in a config file for each invocation, but was not showing any sign of an attempt to close that file afterwards.

Tidying up just that particular resource handling didn't make the problem go away.

Solution 2: Properly clean up resources

After some more digging around I realised that the caching client had its own connection pool that is set up as part of each call, but not shut down afterwards.  This wasn't trivial to change, as the setup for the caching client was nested within a few layers of Builder classes, but some refactoring enabled us to hook this into the lifecycle of the lambda invocation.

Monitoring of the logs showed that the open files count was now stable.


Conclusion

Lambdas are just like any other code we write.  We need to pay attention to the lifecycle and ensure that resources are cleaned up when they have finished being used.

Next steps

Cache entry expiry and cache right-sizing

At the time of writing this post the new, larger cache is growing steadily and showing no sign of steadying off.  So, there is no reason to expect that the swapping issue may not return.

The current cache logic is lacking a default expiration for entries that become stale.  For some of the entries involved there is no reason to expect that the values being encountered will be re-used very often (some might not be encountered more than once a week / month / name your favourite time unit).

Cache connection pooling optimisation
If AWS Lambdas supported shutdown hooks or some other mechanism for detecting when the instance is being abandoned then we could update the lambda to set up the cache connection pool at initialisation and the corresponding closing on shutdown - not today.