AWS re Invent 2020 Handling errors in a serverless world

Show video

hi my name is josh khan and i'm a principal solutions architect at aws focused on serverless in our session today we're going to explore opportunities to enhance reliability and add visibility to errors that you might encounter in your serverless applications no matter how hard we try how much we debug errors are just bound to happen trust me i've run into plenty myself maybe your application receives an unexpected input or your developer accidentally introduces some erroneous code things happen and this is an error message that i've run into more times than i care to count if you're not familiar with it you might see this internal server error com when you're using amazon api gateway and the downstream service in our case generally a lambda function throws an error or returns a malformed response in that case api gateway is going to return with this rather innocuous internal server error now as a developer that's a pretty hard error message for me to deal with there's not a lot of detail here for the end client it's great we don't want to tell the client too much about the underlying infrastructure and what might have happened that caused the failure in the application but as a developer unless i'm intimately familiar with the architecture it can be hard to go troubleshoot an air like this without you know diving into lots and lots of cloud watch log trails lambda can generate a lot of log messages particularly at scale and while cloudwatch has introduced a lot of great tools over the years that allow you to dig into those logs more easily if the log wasn't the error message wasn't actually logged by the developer or somehow the air got swallowed in the application it becomes even harder if not impossible to find in our session today we're going to talk about how to add reliability to your serverless architectures first by understanding the automatic retry behaviors that are built into these services we're also going to talk about the configuration options that are available for those retry behaviors so that you have better control over what happens when an error occurs we're also going to talk about how to add visibility to errors within your lambda functions by looking at the different invocation types that lambda offers the error handling approach and the configuration options are going to differ based on the invocation type and we're also going to try to take an approach that minimizes developer impact that not only means that our developers don't have to spend time building this into every single lambda function that they write but it also means that we're going to have more consistency in how errors are handled across the application please note that this is a 300 level session we're going to be discussing architectural principles and also looking at code samples this talk is not intended to be an introduction to lambda or to serverless we're also looking at just one particular vein of observability which is visibility into errors if you're interested in a more general approach to observability or a discussion of lambda or serverless in general there are some other great sessions available at re invent this year so throughout the talk we're going to use this very simplistic serverless ecommerce architecture to explore the different invocation types that are available in a serverless architecture and for each one we're going to talk about the air handling approach and configuration options available so the first is a synchronous invocation in this case we have a mobile application making a web api type call to amazon api gateway an api gateway then invoking a lambda function that writes to a dynamodb table now in this case the mobile application is actually waiting for a response to this operation an api gateway has a 30 second timeout your lambda function needs to respond within that time because the client is waiting around for the response it's synchronous this for those familiar is also where you might see that internal server error message that i showed earlier the second invocation type is asynchronous in this case we have amazon eventbridge writing a message to a queue internally managed by the lambda service not your function and when eventbridge successfully writes that message to the service lambda essentially responds with hey i got it and eventbridge moves on it's not waiting for your function to actually process that event or that message so that's meaningful because the work is going to happen later down the road whenever your function is available and free to process and do the work again it happens asynchronously the next type of invocation is pull-based and we're actually going to break poll-based invocations into two different categories in both cases the lambda service not your function but the lambda service is polling some event source for messages or records that need to be processed in both cases you're actually going to have records delivered in batches or groups of records not a single event that your function will be responsible for processing the first category of a poll based invocation is streaming sources typically things like dynamodb streams or kinesis the second type or the second category of poll based invocation is from sqs and in this case again the lambda service is pulling for messages on the queue and messages are delivered in batches the error handling approach and the configuration options available for those two different poll based event sources differs so we're going to break those out separately in the talk the last item that's on the screen is aws step functions and we're going to spend just a little bit of time on step functions towards the end but step functions gives you an incredible amount of flexibility and configurability right within the amazon state language that you use to define your step function state machine and you can define things like error handling and retry behavior right within the state machine definition itself so let's get started by talking about synchronous invocations and for each invocation type you'll notice we're going to kind of follow a pattern of the information that we cover first we're going to talk about the retry behavior that's built into the system now in the case of synchronous invocations there is no built-in retry behavior if your function fails for whatever reason the client in the case of our e-commerce example the mobile application is just going to get an error message the client can retry if it light if it would like to but the service itself is not going to automatically retry now we have a few goals for our error handling and no matter the invocation type these goals are pretty consistent so we probably won't talk about them in depth for each invocation type but the primary goal across handling all of these errors is to log the errors and to log them in a consistent structured fashion that allows us to generate things like metrics and alarms from those metrics we also want to log the messages and the errors in such a way that we can derive insights from them even at very large scale and that allows our developers to easily determine what's causing an error and even build capabilities like air budgets to understand the service level of the services that you're managing and building we also want to handle and return from errors in such a way that we can get useful tracing of our distributed system from tools like aws x-ray that's incredibly important as you're building these serverless architectures is to understand where the failure actually occurs now particular to synchronous invocations it's also important to return a sensible or sanitized type error message to the caller there are going to be some types of errors that the client could actually try to recover from versus others where there's a back-end code problem or configuration that the client frankly can't even if it retries so when it makes sense we want to return an obfuscated kind of generic error message like internal server error and in other cases we might want to return an error message that indicates for example an item wasn't found in our e-commerce database so we're going to differentiate between those two types of errors the first category we're going to call client type errors right where a bad payload comes in from the client or it's again trying to find an item that doesn't exist in the database and the second type we'll call server or function type errors and those are errors that the client can't recover from necessarily that would be for example in our e-commerce architecture if our lambda function cannot write to the dynamodb table it doesn't have put item access or permission to do so it's going to throw in access denied air your function also you can only run so many concurrent invocations of lambda functions in your account in a given region at any time anything beyond that will get throttled and essentially a throttle is an error as well so we want to be able to report on those things independently again for each invocation type we're going to talk about the actual code that your developer would need to write in the lambda function as well as an approach for building centralized middleware that can be used across all functions of a given invocation type and if you're not familiar with middleware it's custom code that wraps the function handler and that code can be called before and or after the function handler itself and there's libraries that support you building middleware across all of the runtimes that lambda supports a popular one for the node.js runtime is a library called midi js and it gives you a lot of flexibility to build some great middleware so we're taking the approach of using middleware throughout this talk because again it kind of centralizes how your handling compares and it's going to be consistent across air types without your developers needing to do a lot of work so in the case of our synchronous invocation the lambda function is going to need to differentiate between those client and function type errors so in the case of a client type error something like you know a book wasn't found in our e-commerce catalog it's going to throw a custom air type right something like book not found air and you might equate that with an http error code in the 400 range maybe 404 in this case any other type of air like access denied the function's not even going to try to catch we're just going to let it bubble up and populate up to the middleware so in that middleware layer it's going to catch any errors or exceptions that occur in the lambda function handler itself first it's going to grab that air and log the air to amazon cloudwatch using something called embedded metric format or emf an emf is a capability that was launched not all that long ago that enables ingestion of data in the form of log messages and generate metrics from them now that's a lower latency and lower cost approach than you might have found historically in using the put metric operation so emf is really powerful it gives us metrics it gives us that structured logging that we're going to use in the future and you'll see we're going to use emf in that format repeatedly throughout all of these invocation types to log the errors the service etc that's having the problem in the synchronous case the middleware is then going to inspect the air that it caught if it's one of those custom errors or something that the client should be aware of it can return successfully from the lambda function and when i say return successfully the lambda function still completes it's not going to throw an error but it's going to return with a erroneous status code again maybe 404 in the case of a record not being found with a useful error message for the client if it's any to other type of error something that's happening on the back end the client can't recover from we're going to want to fail that lambda invocation by throwing or raising an exception let's take a look at how to implement that in python now while i've implemented this in python for the talk today you could implement this type of approach using any of the runtimes that lambda supports for our synchronous implementation implementation and actually all of the different invocation types we're using an open source project called aws lambda power tools and power tools is a suite of utilities for lambda functions that makes it easy to adopt best practices such as tracing structured logging and custom metrics it also gives us the ability to define our own custom middleware and that's what we're doing here so you'll see at the top of the code sample we have our function handler it's really pretty simple but it also has a decorator that at air handler just above it that ties it with the middleware down below but in the function handler we're going to go perform that find book operation if no results come back from the database we can raise a book not founder with a little bit of detail about what happened otherwise return with a successful status code and the json payload that you know shows what that book looks like in the air handler itself we're first going to instantiate a power tools logger which gives us a lot of extra goodness and detail out of the box already but then we're basically wrapping a big try catch block around the function handler itself so that if any exceptions are thrown we're going to catch it we're going to log that exception using the power tools logger and then we're going to again take a look at what error was actually thrown if it's a 400 or client type error we can again return successfully with a status code of something like 404 again and a body containing the error message but if it's an actual failure of the function we want to raise a useful exception that we can then log and our developers can diagnose that difference between returning successfully from your function and failing the invocation of the function makes a big difference when we talk about tracing so this is an example using aws x-ray where i have three different functions fronted by amazon api gateway the function that you see at the top of the service map always fails it always raises an exception and you can see it's kind of notated as failing with that yellow circle all the way around it the other two functions one of them fails about 50 of the time because the book that it's looking for isn't in the database now you'll see though that it always has that green circle around it which means that the function is still completing successfully but once you reach api gateway you'll notice that about 50 percent of the time the api is returning successfully with a 200 type status code about 25 percent of the time it's yellow which means it's a warning and a client type error and the other 25 are server side errors and those are red an api gateway will generally return something like internal server error in that case so by returning properly from our lambda function we can get really useful traces out of tools like x-ray one thing to note for the node.js fans out there if you implement your node.js function with an asynchronous handler you'll want to use promise rejection in the case of the function failing so that it shows up properly in the trace as opposed to just raising an exception next let's talk about asynchronous invocations in our e-commerce example this was eventbridge publishing to uh you know a message but amazon sns and amazon s3 are also very popular asynchronous event sources if your lambda lambda function fails while processing an asynchronous event the lambda service will automatically retry two times now you are charged for those additional two times is additional invocations of your function but it does try to retry in the failure automatically and you do have a couple options here to change that behavior first you can use the maximum retry attempts configuration to lower the number of times that the function will automatically be retried down to zero or one retries you can also set the maximum event age so if you only want to retry an event if it was less than 15 seconds you could do that as well so you have some flexibility here in terms of how often your function is going to automatically retry now if the event source fails to deliver to the lambda service or for some reason your function gets throttled the event source will retry publishing that message over hours to even days using exponential back off so it's very important to understand the error behavior and how the system retries to avoid having a lot of unnecessary invokes of your function now asynchronous invocation gives us an additional tool in the toolbox that we didn't have for synchronous invocations and that's a feature called lambda destinations that was released at re invent 2019. essentially what we can do with destinations is route the result of a function execution whether it be successful or in our case a failure automatically to an amazon sns topic an sqsq an event bridge event bus or even another lambda function without any additional code and the payload that gets routed to that destination includes not just the original event or the original payload but also some additional metadata like the number of retry attempts the function name etc now when you go to configure your lam to destination you can use something like aws sam as we're showing on the screen and you actually configure that destination as a property of the serverless function that's important to note because a little later in the talk we're going to configure another destination but we have to configure it in a different way all right so how do we handle errors in the asynchronous invocation in your function itself generally just throw the air let the middleware handle it if your function does call some downstream service you might want to consider implementing a retry logic or some sort of circuit breaker pattern but in most cases throwing the air is just fine in the middleware we want to catch all the errors again log the errors again to cloudwatch using emf and then re-raise the air so that it the function fails and by re-raising after you've reached the maximum number of retries that event will be published to your on failure destination again with the original event data and some additional metadata the implementation actually looks very similar again we're using aws lambda power tools to build some middleware here we're showing you how to add some additional detail to what messages get logged to cloudwatch by adding the lambda context and we're also showing how you could potentially break out different types of errors here we're breaking out aws service errors from the bottle framework and handling them slightly differently than we would just any other generalized exception next let's talk about the first of our pull-based invocations streaming event sources like dynamodb streams or kinesis now in these cases there's messages sitting on a stream and lambda will keep the lambda service will keep retrying as long as those messages are sitting in the stream if the function fails but again it's important to remember here that these messages are delivered in batches of messages it's not just one it's generally a batch now you can control which messages get published in that batch using the maximum record age and maximum retry attempts which are similar to what you'd expect based on the asynchronous invocation of similar name but there's also a feature here called bisect batch on function air and essentially what that does is it starts to have the batch and invoke the function again on that half each time the batch fails and eventually as we have and have and have you'll get down to just the one or small number of records in the batch that are failing being processed on their own and successfully processed events being successfully processed now if you do choose to use that bisect on batch type approach it's important that your function be item potent which means it can run multiple times successfully without additional side effects so it's a bisection batch is a very powerful feature but it's important to understand what it means that also does increase your function invocations but it's not counted toward the retry limit the handling in this case is very similar to our asynchronous invocation we're going to throw errors we're going to catch them in the middleware log them and re-raise the air we also have access to lambda destinations here so we're going to send any failing message to an on failure destination but there's two things that are important to know here first your on failure destinations can only be an sqsq or an sns topic in this case and that destination does not receive the original payload instead it receives metadata about the stream itself and you can use that metadata to then go look up the batch of records that failed in the future but if you're just trying to process the events that show up in your failure queue for example the same way you would with an asynchronous invocation it's not going to work that way you have to go look up those that batch on the stream again now here's when i learned the hard way when you go to configure your destination for a streaming uh source you actually configure that destination on the stream itself not the function so you can see that here in that this aws sam template we have a destination config associated with our dynamodb stream not with the function itself our second poll based invocation type is for sqs now again we've broken sqs out from streaming event sources because the approach to handling errors and the configuration options available are different like the streaming event source sqs delivers messages in batches to your lambda functions and those messages are going to be retried automatically by the lambda service for as long as the message is on the queue now the one major thing that's different here is that when messages are successfully processed in that batch your lambda function and could actually go and delete that message from the queue so that it it's not included in a future batch but again the default retry behavior is to automatically retry for as long as the message is on the queue you do have some configuration options here like using dead letter queues either for the lambda function itself or for your queue sqsq you also have control of the batch size from one message all the way up to 10 records in a single batch we recently also introduced a batch window size for a time-based batch you can control the visibility timeout on the queue itself our approach to handling errors in this case is a little bit more complex than the others that we've already talked about now in this case what i would recommend is in your lambda function have a separate you know record handler or another method that can process each record in the batch individually when those records are processed successfully great but if there's an error catch any of those errors in that record handler method and then return it as a group the successfully processed ones the erroneous ones to your middleware your middleware can look at the results of the entire batch if there are no errors the middleware can simply return and in that case the lambda service will actually go and remove all of the messages in that batch from the sqsq it's really powerful and simple but if there is an error the middleware should actually iterate through the results delete the messages from the queue that were successfully processed and then raise an error indicating that there are some erroneous messages left in the batch now you might choose to let your function reprocess those or immediately move them to a dead letter queue that's up to you based on your your business needs now that sounds a little bit complex it's not terribly hard to implement but the really good news is that some projects like lambda power tools gives you the ability to handle partial failures when processing batches of messages from sqs event sources this is a really simple example that demonstrates how to do that using lambda power tools in python this is also available in the java version of power tools and there's also a readily available plugin for midi js that you can use to implement the exact same approach to handling errors in sqs records or batches of records without having to implement too much on your own so it's been done in a lot of cases and it makes it a lot easier and more efficient to handle batches of sq messages from sqs when there are errors let's also talk about step functions before we finish up today as i mentioned earlier step functions is incredibly powerful very configurable and we can define the retry behavior as well as error handling right within the state machine itself i could even go and implement a centralized error handler right within the state machine definition as well so i can get a consistency right within the state machine and we can customize this to meet the business needs of the complex process that you're orchestrating with your state machine let's look at a simple example so here we have a pretty simple state machine it has a task called call api which invokes some lambda function and if you look at the json on the left you'll actually see that we've implemented the retry behavior and a try catch type logic right within the state machine definition so if that lambda function returns an error of too many requests exception the state machine will automatically retry that step or that task up to two more times with a one second interval in between so your retry behavior is defined right within the state machine the developer of the function doesn't need to know anything about it now after that maximum number of attempts or if another error occurs in the function it'll actually bubble down into that catch block and here you can see you know if we have exceeded the maximum number of attempts for too many requests exception it'll move over to a wait and try later state and presumably try again if there's another error that occurs like server unavailable exception it'll move to another state in the state machine or if there's an error that occurs that we haven't explicitly named we can catch it using something called states.all which basically catches any other errors that might occur so that we can handle them again step functions gives you a lot of flexibility and configurability here to build in the retry logic and the try catch type logic that makes sense for your state machine we've spent a lot of time logging today you know we talked about logging for pretty much every invocation type so what does that do for us it gives us more visibility into our errors so by using cloudwatch embedded metric format you'll get metrics out of that and you can actually build dashboards and alarms off of those metrics because these are structured log messages you can also see in the screenshot on the right we can do things like run a query in cloudwatch logs insights to go find all of the error messages in a given time period for our developers to help dive in and try to figure out what went wrong and of course by returning properly from our lambda function we can use tools like aws x-ray to trace a distributed system and understand what's happening find the failures now of course all of this is just a start this gives you a lot more insight into your errors you're handling them more consistently but again you can also go build things like cloudwatch dashboards alarms so that service owners are aware of errors exceeding some threshold you can also use techniques like chaos engineering to go generate those errors to ensure that your your air handling is as you want it to be so i'd like to thank you for taking the time to listen to our talk today we do have some other resources and documentation available to you on serverless land and there's a short code link there as well as a qr code that you can use to access that information so again thank you please do take a moment to fill out the session survey we really value your feedback thanks and have a good day

2021-02-10

Show video